Scrapy is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival.
If you plan to use Scrapy with BotProxy the easiest way to go is to use our downloader middleware for Scrapy. To use other proxy follow instructions below.
In this example we will use our IP rotating proxy server with Scrapy. Your outgoing IP address will be automatically rotated with subsequent requests.
Create a new file called âmiddlewares.pyâ and save it in your scrapy project and add the following code to it. Replace USERNAME
and PASSWORD
with your proxy access credentials. You can also use USERNAME+us
or USERNAME+us-ny
to limit outgoing locations (refer to our Docs section).
import base64
# Start your middleware class
class ProxyMiddleware(object):
# overwrite process request
def process_request(self, request, spider):
# Set the location of the proxy
request.meta['proxy'] = "http://x.botproxy.net:8080"
# Use the following lines if your proxy requires authentication
auth_creds = "USERNAME:PASSWORD"
# setup basic authentication for the proxy
access_token = base64.encodestring(auth_creds)
request.headers['Proxy-Authorization'] = 'Basic ' + access_token
Open your projectâs configuration file (./project_name/settings.py) and add the following code
DOWNLOADER_MIDDLEWARES = {
'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 110,
'project_name.middlewares.ProxyMiddleware': 100,
}
Now all your requests will go through the configured proxy.
Your spider code should look like this:
from scrapy.spider import BaseSpider
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.http import Request
class TestSpider(CrawlSpider):
name = "test"
domain_name = "whatismyip.com"
# The following url is subject to change, you can get the last updated one from here :
# http://www.whatismyip.com/faq/automation.asp
start_urls = ["http://automation.whatismyip.com/n09230945.asp"]
def parse(self, response):
open('test.html', 'wb').write(response.body)