Handling Pagination in Scrapy: Avoiding Infinite Loops and Duplicate Requests
Scrapy is an excellent tool for web scraping, but handling pagination can be challenging. A user recently encountered issues when attempting to scrape Exploit-DB page by page, specifically running into duplicate requests and an inability to navigate to the next page.
Let’s dissect the problem and explore solutions to handle pagination effectively, leveraging Scrapy’s features.
The Problem
Here’s the original Scrapy spider:
class TestSpider(scrapy.Spider):
name = "PLC"
allowed_domains = ["exploit-db.com"]
start_urls = [
"https://www.exploit-db.com/local/"
]
def parse(self, response):
filename = response.url.split("/")[-2] + '.html'
links = response.xpath('//tr/td[5]/a/@href').extract()
description = response.xpath('//tr/td[5]/a[@href]/text()').extract()
for data, link in zip(description, links):
if "PLC" in data:
with open(filename, "a") as f:
f.write(data + '\n')
f.write(link + '\n\n')
next_page = response.xpath('//div[@class="pagination"][1]//a/@href').extract()
if next_page:
url = response.urljoin(next_page[0])
yield scrapy.Request(url, callback=self.parse)
When running this spider, the user encountered the following issues:
- Scrapy filtered out duplicate requests.
- The pagination XPath potentially selected the same page repeatedly, leading to infinite loops.
Here’s the error output from the console:
2016-06-08 16:05:21 [scrapy] INFO: Enabled item pipelines: []
2016-06-08 16:05:21 [scrapy] INFO: Spider opened
2016-06-08 16:05:21 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-06-08 16:05:21 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-06-08 16:05:22 [scrapy] DEBUG: Crawled (200) <GET https://www.exploit-db.com/robots.txt> (referer: None)
2016-06-08 16:05:22 [scrapy] DEBUG: Crawled (200) <GET https://www.exploit-db.com/local/> (referer: None)
2016-06-08 16:05:23 [scrapy] DEBUG: Crawled (200) <GET https://www.exploit-db.com/local/?order_by=date&order=desc&pg=2> (referer: https://www.exploit-db.com/local/)
2016-06-08 16:05:23 [scrapy] DEBUG: Crawled (200) <GET https://www.exploit-db.com/local/?order_by=date&order=desc&pg=1> (referer: https://www.exploit-db.com/local/?order_by=date&order=desc&pg=2)
2016-06-08 16:05:23 [scrapy] DEBUG: Filtered duplicate request: <GET https://www.exploit-db.com/local/?order_by=date&order=desc&pg=2> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2016-06-08 16:05:23 [scrapy] INFO: Closing spider (finished)
2016-06-08 16:05:23 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1162,
'downloader/request_count': 4,
'downloader/request_method_count/GET': 4,
'downloader/response_bytes': 40695,
'downloader/response_count': 4,
'downloader/response_status_count/200': 4,
'dupefilter/filtered': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 6, 8, 8, 5, 23, 514161),
'log_count/DEBUG': 6,
'log_count/INFO': 7,
'request_depth_max': 3,
'response_received_count': 4,
'scheduler/dequeued': 3,
'scheduler/dequeued/memory': 3,
'scheduler/enqueued': 3,
'scheduler/enqueued/memory': 3,
'start_time': datetime.datetime(2016, 6, 8, 8, 5, 21, 561678)}
2016-06-08 16:05:23 [scrapy] INFO: Spider closed (finished)
Solution 1: Using dont_filter=True
To bypass Scrapy’s duplicate request filtering, add the dont_filter=True
parameter to the request:
if next_page:
url = response.urljoin(next_page[0])
yield scrapy.Request(url, callback=self.parse, dont_filter=True)
However, this alone doesn’t solve the issue of selecting the same link multiple times.
Solution 2: Correcting the Pagination XPath
The problem lies in the XPath for the next page. The current XPath may select links inconsistently or retrieve duplicate links. Instead, use a more specific selector:
next_page = response.css(".pagination").xpath('.//a[contains(@class, "next")]/@href').get()
This ensures that only the “next” link is selected, even if the site’s structure changes (e.g., adding Bootstrap classes).
Enhanced Spider with Pagination Fix
Here’s the revised spider:
class TestSpider(scrapy.Spider):
name = "PLC"
allowed_domains = ["exploit-db.com"]
start_urls = [
"https://www.exploit-db.com/local/"
]
def parse(self, response):
filename = response.url.split("/")[-2] + '.html'
links = response.xpath('//tr/td[5]/a/@href').extract()
description = response.xpath('//tr/td[5]/a[@href]/text()').extract()
for data, link in zip(description, links):
if "PLC" in data:
with open(filename, "a") as f:
f.write(data + '\n')
f.write(link + '\n\n')
next_page = response.css(".pagination").xpath('.//a[contains(@class, "next")]/@href').get()
if next_page:
url = response.urljoin(next_page)
yield scrapy.Request(url, callback=self.parse, dont_filter=True)
Solution 3: Using BotProxy for Smooth Scraping
Some websites implement strict anti-bot measures that can block or throttle your requests. BotProxy is a robust solution that provides:
- Rotating Proxies: Avoid IP bans by using fresh proxies for each request.
- Anti-Detect Mode: Spoofs TLS fingerprints, mimicking real browsers.
- Global Proxy Network: Enables scraping from multiple regions.
To integrate BotProxy, update your spider to route requests through a proxy:
proxy = "http://user-key:[email protected]:8080"
if next_page:
url = response.urljoin(next_page)
yield scrapy.Request(url, callback=self.parse, dont_filter=True, meta={"proxy": proxy})
Conclusion
Handling pagination in Scrapy requires careful consideration of selectors and duplicate filtering. By refining your XPath or CSS selectors and leveraging tools like BotProxy, you can ensure a smooth, efficient scraping process. Happy scraping!