Need Proxy?

BotProxy: Rotating Proxies Made for professionals. Really fast connection. Built-in IP rotation. Fresh IPs every day.

Find out more


How to simulate xhr request using Scrapy when trying to crawl data from an ajax-based webstie?

Question

I am new to crawl the webpage using Scrapy and unfortunately chose a dynamic one to start...

I've successfully crawled part (120 links), thanks to someone helping me here, but not links in target website

After doing some research, I know that crawling ajax web is nothing different from those simple ideas:

•open browser developer tools, network tab

•go to the target site

•click submit button and see what XHR request is going to the server

•simulate this XHR request in your spider

The last one sounds obscure to me though---How to simulate XHR request?

I've seen someone using 'headers' or 'formdata' and other parameters to simulate. Can't figure out what does that mean.

Here is part of my code:

class googleAppSpider(scrapy.Spider):
name = "googleApp"
allowed_domains = ['play.google.com']
start_urls = ['https://play.google.com/store/apps/category/GAME/collection/topselling_new_free?authuser=0']

def start_request(self,response):
    for i in range(0,10): 
        yield FormRequest(url="https://play.google.com/store/apps/category/GAME/collection/topselling_new_free?authuser=0", method="POST", formdata={'start':str(i+60),'num':'60','numChildren':'0','ipf':'1','xhr':'1','token':'m1VdlomIcpZYfkJT5dktVuqLw2k:1455483261011'}, callback=self.parse)

def parse(self,response):
    links = response.xpath("//a/@href").extract()
    crawledLinks = [ ]
    LinkPattern = re.compile("^/store/apps/details\?id=.")
    for link in links:
        if LinkPattern.match(link) and not link in crawledLinks:
            crawledLinks.append("http://play.google.com"+link+"#release")
    for link in crawledLinks:
            yield scrapy.Request(link, callback=self.parse_every_app)

def parse_every_app(self,response):

The start_request seems to not play any role here. If I delete them, the spider would still crawl the same amount of links.

I've worked on this problem for a week... Highly appreciate it if you could help me out...

Answer

Try this:

class googleAppSpider(Spider):
    name = "googleApp"
    allowed_domains = ['play.google.com']
    start_urls = ['https://play.google.com/store/apps/category/GAME/collection/topselling_new_free?authuser=0']

    def parse(self,response):
        for i in range(0,10): 
            yield FormRequest(url="https://play.google.com/store/apps/category/GAME/collection/topselling_new_free?authuser=0", method="POST", formdata={'start':str(i*60),'num':'60','numChildren':'0','ipf':'1','xhr':'1','token':'m1VdlomIcpZYfkJT5dktVuqLw2k:1455483261011'}, callback=self.data_parse)

    def data_parse(self,response):
        item = googleAppItem()
        map = {}
        links = response.xpath("//a/@href").re(r'/store/apps/details.*')
        for l in links:
            if l not in map:
                map[l] = True
                item['url'] = l
                yield item

Crawl the spider using scrapy crawl -o links.csv or scrapy crawl -o links.json you'll get all the links in a csv file or a json file. To increase the number of pages to crawl, change the range of for loop.

cc by-sa 3.0