Probably, there is no webmaster in the world who would like that her website is scraped. Of course, it's a bit flattering when your site is being scraped. Hence, it is so good that someone wanted to scrape your content, but there are more downsides from such activity.
Website scraping is an additional expense for its owner. And the more the site and the more aggressive the scraping - the more costs. Each scraper bot request is an additional load on the CPU and database, which at this time could serve normal site visitors. Therefore, no one likes being scraped.
The more the load created by the parser on the site, the faster it will be blocked.
How to scrape websites without being blocked? A universal answer - the activity of the bot should be indistinguishable from the activity of an ordinary visitor. It is clear that for each particular site the implementation will be different. And the attendance, depth of view, and session time vary greatly from site to site, and to write a good scraping bot you should study the target site's audience well.
If the amount of information you want to scrape is large, and the usual traffic to the site is low, it will be impossible to create a bot that will quickly get all the information and will not be blocked. In this case, the only hope is that no one is following the site, and you will be able to complete your task before being blocked.
Put ourselves in the place of the webmaster, whose site you want to scrape. How to make your site stop being scraped? Again, the universal answer is to make the cost of scraping high. We usually use scraping because it's the cheapest way to get the information you need. If it is cheaper to buy a database than develop a scraper bot and scrape data from someone else's website - it is obvious that you will choose the first option.
However, do not forget about ordinary website visitors. Scraping protection should not cause them any particular inconvenience. Based on all of the above, consider next the specific ways to combat bots and how to get around them.
The easiest way. If the webmaster sees that scraping is taking place from one or more IP addresses, he can block these addresses with just a few clicks. Therefore, you should never run the bot from your working computer. Moreover, even during development, use a proxy server.
Scrape using proxy servers. The more different IP addresses you have - the harder it is to block them all, and you have the option to write a scraper as full as possible to immigrate a normal user's visit.
IP addresses and accordingly proxy servers are of two types - resident, and data center based. The first respectively belong to Internet providers and are given to end users for access to the Internet. You can easily find providers of proxy servers with both resident IP addresses and addresses in data centers.
Resident proxy is much more expensive. First, it is almost impossible to obtain legal resident IP addresses, since they are issued only to end users. Secondly, companies that sell such proxies often use gray schemes or hacked users' computers. So keep this in mind.
Proxies in the data centers are cheaper, because, firstly, they are easier to create and maintain. And secondly, sufficiently advanced webmasters can build the protection of their site, taking into account the type of visitor's IP address and, for example, when a request is made from data center IP address, show CAPTCHA before real content.
In order to avoid blocking because of rate limit (more on this below), it is advisable to use a pool of proxy servers. Each new request or a new session is originated through a different proxy server with a new IP address. Service BotProxy.net provides proxy servers located in the data centers with the function of automatic rotation of the IP address. Requests sent through this service are automatically grouped into sessions and routed through different proxies. All you need is to specify one proxy address of the server x.botproxy.net: 8080
in your proxy settings. All requests within the same session go to the same external address. In addition, external IP addresses of proxy servers change every day. There are more than 15 locations around the world where proxies are available.
If you scrape a site that is targeted to a local audience (city, region, country), use the proxy address from this region. A large number of requests from outside locations can serve as a trigger for IP locking.
Blocking when a certain number of requests are reached per unit of time. The specific values of the limits can be different and very much depend on the sites. On average, if the limit is set, its value is within 1-2 requests per second. Accordingly, you need to control the number of requests from one IP address and not exceed the limits. The concrete value when a limit is reached can only be determined by experience.
Another effective method of blocking - counting the number of requests from one IP address per day. If it exceeds a certain value, for example, 500-1000 queries, this may be the reason for blocking the IP, especially if it is the IP address of the data center. Ordinary visitors rarely go through proxy servers.
Each user request is accompanied by a large amount of related information transmitted automatically by the browser in the request headers. Among the most important headers are:
User-Agent
Accept
, Accept-Encoding
, Accept-Language
Cookie
Referer
The User-Agent
header transfers information about your browser and operating system. The Accept
headers - about the configuration of the browser. Cookie
transmit the information that the server put to a client, andReferer
about the source of the traffic (for more information on the headers, see List of HTTP Headers).
Usually, libraries in programming languages do not send these headers but explicitly identify themselves. The first thing you should do when developing a scraper is to automate the saving of cookies and the transfer of Referer
headers. Very often access to information is blocked if these headers are not set.
User-Agent
Advanced webmasters can analyze User-Agent
(UA) header string to decide whether to block your requests or not. In the simplest case, you can count the number of requests with identical UAs and block them if UA data traffic causes suspicion in scraping. Usually, it's very easy to determine as the pages visited by a regular user and a bot are very different. Therefore the second thing that needs to be done in your scraper is to implement periodic UA string change. It is advisable to change it with every new session.
You can read about User-Agent
string details on for example MDN. If you try to construct a similar string randomly, it is very easy to get a UA that does not exist in nature, for example, the Safari browser in the Ubuntu operating system. Advanced bot blocking systems know all existing versions of browsers, as well as statistics on their popularity. Therefore, an incorrectly constructed UA string can lead to a CAPTCHA challenge after a small number of requests for your IP address or simply to block access to information.
Therefore, find the latest UA statistics and use this data to generate headers. Keep in mind that some sites generate a different response depending on the User Agent, especially with regard to mobile/desktop versions of sites, as well as a supported language.
In addition to the contents of the string User-Agent
, it should also be remembered that certain versions of browsers send certain values of other headers to correspond to these versions of browsers. Large companies, CDN networks can monitor and match the data and, if anomalies are detected, block traffic or take other measures.
Collect the statistics for the headers you send and use them together with the corresponding UAs. Hint: if you have a popular website, log the headers sent by your users and use them when scraping other websites.
The browser sends information about user's installed languages. Send these headers according to the IP address used and the language of the target website. For example, if you parse a French website from an IP address from the United States, add English and French to the 'Accept-Language` header.
It is not even interesting to write about CAPTCHA. Everyone knows what it is and users do not like it even more than bot creators. Despite the fact that captcha is now difficult to solve even for a human Stanford University study has found that 13 out of 15 CAPTCHA systems used by popular websites are susceptible to automated attacks. This is due to the development of machine learning techniques that can recognize images or audio recordings. And for the remaining two systems, even without the use of such complex technologies, there are many services that exploit cheap labor from third world countries that manually solve CAPTCHA challenges.
Typically, CAPTCHA is requested 1 time per session. The cost of its solution on third-party services is around $0.0001 and less. If you need to solve a CAPTCHA in your scraper it is often easier to integrate one of these services (for example Antigate) into your parser rather then developing some custom solution.
In rare cases for example when parsing Google search results it's very difficult to get around CAPTCHA and it's better to concentrate efforts on other workarounds like using a large number of proxy servers.
Perhaps we will post a separate article with a detailed description of how to integrate anti CAPTCHA later. Check back to our website.
The most advanced ways to combat scraping analyze user behavior. Different statistical parameters are calculated and if a user behavior deviates significantly from the median values it is marked as suspicious and can be blocked or redirected to a page with CAPTCHA challenge.
Other Bot detection technology makes sure the browser has the correct JavaScript engine, is formatted correctly, and that all of its components are performing as they should.
You can also use the analysis of the requested files. 99% of bots never request CSS, JS files and other resources.
Traps (also called honeypots) are a very simple and effective way to combat bots. A trap is a kind of page that a bot will visit but a regular person will never do. For example, we can put a honeypot link to a page. The link can be hidden using CSS or Javascript. When bot requests HTML code, but does not analyze the visibility of the elements and does not execute javascript code to manipulate the DOM, it visits such a link and thus IP address of the bot becomes compromised and can be blocked.
The closely related method with the previous one. The page is implemented a certain javascript code which is executed by the browser but is not executed by the bot.
To circumvent such protection measures against bots you can use browser-automation. That is to receive data using real browsers using their API. This is much more expensive than light parsers based on scripts, but it is often the only alternative for obtaining the required data.
We do not write much about advanced bot mitigation methods as fighting them requires a lot of investigation of a particular case. And when you are ready to do this you definitely do not need our advice.
It is interesting to know what bot countermeasures are suggested by OWASP Automated Thread Handbook:
Note that in certain applications, some types of Scraping may be desirable, or even encouraged, rather than being threats.
The fight against bots and the search for new ways to bypass the locks are inexhaustible. For each action of one side, there is resistance to the other. Therefore, no matter what side you are at the moment, you will not be bored. Every web programmer will eventually find himself on either side of the fight.
Connect your software to ultra fast rotating proxies with daily fresh IPs and worldwide locations in minutes. We allow full speed multithreaded connections and charge only for bandwidth used. Typical integrations take less than 5 minutes into any script or application.