BotProxy: Rotating Proxies Made for professionals. Really fast connection. Built-in IP rotation. Fresh IPs every day.
I am using a simple CrawlSpider
implementation to crawl websites. By default Scrapy
follows 302 redirects to target locations and kind of ignores the originally requested link. On a particular site I encountered a page which 302 redirects to another page. What I aim to do is log both the original link(which responds 302) and the target location(specified in HTTP response header) and process them in parse_item
method of CrawlSpider
. Please guide me, how can I achieve this ?
I came across solutions mentioning to use dont_redirect=True
or REDIRECT_ENABLE=False
but I do not actually want to ignore the redirects, in fact I want to consider(i.e not ignore) the redirecting page as well.
eg: I visit http://www.example.com/page1
which sends a 302 redirect HTTP response and redirects to http://www.example.com/page2
. By default, scrapy ignore page1
, follows to page2
and processes it. I want to process both page1
and page2
in parse_item
.
EDIT
I am already using handle_httpstatus_list = [500, 404]
in class definition of spider to handle 500
and 404
response codes in parse_item
, but the same is not working for 302
if I specify it in handle_httpstatus_list
.
Scrapy 1.0.5 (latest official as I write these lines) does not use handle_httpstatus_list
in the built-in RedirectMiddleware -- see this issue.
From Scrapy 1.1.0 (1.1.0rc1 is available), the issue is fixed.
Even if you disable redirects, you can still mimic its behavior in your callback, checking the Location
header and returning a Request
to the redirection
Example spider:
$ cat redirecttest.py
import scrapy
class RedirectTest(scrapy.Spider):
name = "redirecttest"
start_urls = [
'http://httpbin.org/get',
'https://httpbin.org/redirect-to?url=http%3A%2F%2Fhttpbin.org%2Fip'
]
handle_httpstatus_list = [302]
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url, dont_filter=True, callback=self.parse_page)
def parse_page(self, response):
self.logger.debug("(parse_page) response: status=%d, URL=%s" % (response.status, response.url))
if response.status in (302,) and 'Location' in response.headers:
self.logger.debug("(parse_page) Location header: %r" % response.headers['Location'])
yield scrapy.Request(
response.urljoin(response.headers['Location']),
callback=self.parse_page)
Console log:
$ scrapy runspider redirecttest.py -s REDIRECT_ENABLED=0
[scrapy] INFO: Scrapy 1.0.5 started (bot: scrapybot)
[scrapy] INFO: Optional features available: ssl, http11
[scrapy] INFO: Overridden settings: {'REDIRECT_ENABLED': '0'}
[scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
[scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
[scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
[scrapy] INFO: Enabled item pipelines:
[scrapy] INFO: Spider opened
[scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
[scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
[scrapy] DEBUG: Crawled (200) <GET http://httpbin.org/get> (referer: None)
[redirecttest] DEBUG: (parse_page) response: status=200, URL=http://httpbin.org/get
[scrapy] DEBUG: Crawled (302) <GET https://httpbin.org/redirect-to?url=http%3A%2F%2Fhttpbin.org%2Fip> (referer: None)
[redirecttest] DEBUG: (parse_page) response: status=302, URL=https://httpbin.org/redirect-to?url=http%3A%2F%2Fhttpbin.org%2Fip
[redirecttest] DEBUG: (parse_page) Location header: 'http://httpbin.org/ip'
[scrapy] DEBUG: Crawled (200) <GET http://httpbin.org/ip> (referer: https://httpbin.org/redirect-to?url=http%3A%2F%2Fhttpbin.org%2Fip)
[redirecttest] DEBUG: (parse_page) response: status=200, URL=http://httpbin.org/ip
[scrapy] INFO: Closing spider (finished)
Note that you'll need http_handlestatus_list
with 302 in it, otherwise, you'll see this kind of log (coming from HttpErrorMiddleware
):
[scrapy] DEBUG: Crawled (302) <GET https://httpbin.org/redirect-to?url=http%3A%2F%2Fhttpbin.org%2Fip> (referer: None)
[scrapy] DEBUG: Ignoring response <302 https://httpbin.org/redirect-to?url=http%3A%2F%2Fhttpbin.org%2Fip>: HTTP status code is not handled or not allowed