Using Scrapy with Proxies (IP Rotating Proxy)

Scrapy is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival.

If you plan to use Scrapy with BotProxy the easiest way to go is to use our downloader middleware for Scrapy. To use other proxy follow instructions below.

In this example we will use our IP rotating proxy server with Scrapy. Your outgoing IP address will be automatically rotated with subsequent requests.

Create a new file called âmiddlewares.pyâ and save it in your scrapy project and add the following code to it. Replace USERNAME and PASSWORD with your proxy access credentials. You can also use USERNAME+us or USERNAME+us-ny to limit outgoing locations (refer to our Docs section).

import base64

# Start your middleware class
class ProxyMiddleware(object):
    # overwrite process request
    def process_request(self, request, spider):
        # Set the location of the proxy
        request.meta['proxy'] = "http://x.botproxy.net:8080"

        # Use the following lines if your proxy requires authentication
        auth_creds = "USERNAME:PASSWORD"
        # setup basic authentication for the proxy
        access_token = base64.encodestring(auth_creds)
        request.headers['Proxy-Authorization'] = 'Basic ' + access_token

Open your projectâs configuration file (./project_name/settings.py) and add the following code

DOWNLOADER_MIDDLEWARES = {
    'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 110,
    'project_name.middlewares.ProxyMiddleware': 100,
    }

Now all your requests will go through the configured proxy.

Your spider code should look like this:

from scrapy.spider import BaseSpider
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.http import Request

class TestSpider(CrawlSpider):
    name = "test"
    domain_name = "whatismyip.com"
    # The following url is subject to change, you can get the last updated one from here :
    # http://www.whatismyip.com/faq/automation.asp
    start_urls = ["http://automation.whatismyip.com/n09230945.asp"]

    def parse(self, response):
        open('test.html', 'wb').write(response.body)