BotProxy: Rotating Proxies Made for professionals. Really fast connection. Built-in IP rotation. Fresh IPs every day.
I am new to crawl the webpage using Scrapy and unfortunately chose a dynamic one to start...
I've successfully crawled part (120 links), thanks to someone helping me here, but not links in target website
After doing some research, I know that crawling ajax web is nothing different from those simple ideas:
â¢open browser developer tools, network tab
â¢go to the target site
â¢click submit button and see what XHR request is going to the server
â¢simulate this XHR request in your spider
The last one sounds obscure to me though---How to simulate XHR request?
I've seen someone using 'headers' or 'formdata' and other parameters to simulate. Can't figure out what does that mean.
Here is part of my code:
class googleAppSpider(scrapy.Spider):
name = "googleApp"
allowed_domains = ['play.google.com']
start_urls = ['https://play.google.com/store/apps/category/GAME/collection/topselling_new_free?authuser=0']
def start_request(self,response):
for i in range(0,10):
yield FormRequest(url="https://play.google.com/store/apps/category/GAME/collection/topselling_new_free?authuser=0", method="POST", formdata={'start':str(i+60),'num':'60','numChildren':'0','ipf':'1','xhr':'1','token':'m1VdlomIcpZYfkJT5dktVuqLw2k:1455483261011'}, callback=self.parse)
def parse(self,response):
links = response.xpath("//a/@href").extract()
crawledLinks = [ ]
LinkPattern = re.compile("^/store/apps/details\?id=.")
for link in links:
if LinkPattern.match(link) and not link in crawledLinks:
crawledLinks.append("http://play.google.com"+link+"#release")
for link in crawledLinks:
yield scrapy.Request(link, callback=self.parse_every_app)
def parse_every_app(self,response):
The start_request seems to not play any role here. If I delete them, the spider would still crawl the same amount of links.
I've worked on this problem for a week... Highly appreciate it if you could help me out...
Try this:
class googleAppSpider(Spider):
name = "googleApp"
allowed_domains = ['play.google.com']
start_urls = ['https://play.google.com/store/apps/category/GAME/collection/topselling_new_free?authuser=0']
def parse(self,response):
for i in range(0,10):
yield FormRequest(url="https://play.google.com/store/apps/category/GAME/collection/topselling_new_free?authuser=0", method="POST", formdata={'start':str(i*60),'num':'60','numChildren':'0','ipf':'1','xhr':'1','token':'m1VdlomIcpZYfkJT5dktVuqLw2k:1455483261011'}, callback=self.data_parse)
def data_parse(self,response):
item = googleAppItem()
map = {}
links = response.xpath("//a/@href").re(r'/store/apps/details.*')
for l in links:
if l not in map:
map[l] = True
item['url'] = l
yield item
Crawl the spider using scrapy crawl -o links.csv
or scrapy crawl -o links.json
you'll get all the links in a csv file or a json file. To increase the number of pages to crawl, change the range of for loop.