Need Proxy?

BotProxy: Rotating Proxies Made for professionals. Really fast connection. Built-in IP rotation. Fresh IPs every day.

Find out more

How to simulate xhr request using Scrapy when trying to crawl data from an ajax-based webstie?

Question

I am new to crawl the webpage using Scrapy and unfortunately chose a dynamic one to start...

I've successfully crawled part (120 links), thanks to someone helping me here, but not links in target website

After doing some research, I know that crawling ajax web is nothing different from those simple ideas:

â¢open browser developer tools, network tab

â¢go to the target site

â¢click submit button and see what XHR request is going to the server

â¢simulate this XHR request in your spider

The last one sounds obscure to me though---How to simulate XHR request?

I've seen someone using 'headers' or 'formdata' and other parameters to simulate. Can't figure out what does that mean.

Here is part of my code:

class googleAppSpider(scrapy.Spider):
name = "googleApp"
allowed_domains = ['play.google.com']
start_urls = ['https://play.google.com/store/apps/category/GAME/collection/topselling_new_free?authuser=0']

def start_request(self,response):
    for i in range(0,10): 
        yield FormRequest(url="https://play.google.com/store/apps/category/GAME/collection/topselling_new_free?authuser=0", method="POST", formdata={'start':str(i+60),'num':'60','numChildren':'0','ipf':'1','xhr':'1','token':'m1VdlomIcpZYfkJT5dktVuqLw2k:1455483261011'}, callback=self.parse)

def parse(self,response):
    links = response.xpath("//a/@href").extract()
    crawledLinks = [ ]
    LinkPattern = re.compile("^/store/apps/details\?id=.")
    for link in links:
        if LinkPattern.match(link) and not link in crawledLinks:
            crawledLinks.append("http://play.google.com"+link+"#release")
    for link in crawledLinks:
            yield scrapy.Request(link, callback=self.parse_every_app)

def parse_every_app(self,response):

The start_request seems to not play any role here. If I delete them, the spider would still crawl the same amount of links.

I've worked on this problem for a week... Highly appreciate it if you could help me out...

Answer

Try this:

class googleAppSpider(Spider):
    name = "googleApp"
    allowed_domains = ['play.google.com']
    start_urls = ['https://play.google.com/store/apps/category/GAME/collection/topselling_new_free?authuser=0']

    def parse(self,response):
        for i in range(0,10): 
            yield FormRequest(url="https://play.google.com/store/apps/category/GAME/collection/topselling_new_free?authuser=0", method="POST", formdata={'start':str(i*60),'num':'60','numChildren':'0','ipf':'1','xhr':'1','token':'m1VdlomIcpZYfkJT5dktVuqLw2k:1455483261011'}, callback=self.data_parse)

    def data_parse(self,response):
        item = googleAppItem()
        map = {}
        links = response.xpath("//a/@href").re(r'/store/apps/details.*')
        for l in links:
            if l not in map:
                map[l] = True
                item['url'] = l
                yield item

Crawl the spider using scrapy crawl -o links.csv or scrapy crawl -o links.json you'll get all the links in a csv file or a json file. To increase the number of pages to crawl, change the range of for loop.

cc by-sa 3.0