BotProxy: Rotating Proxies Made for professionals. Really fast connection. Built-in IP rotation. Fresh IPs every day.
Scrapy version: 1.0.5
I have searched for long time, but most of workarounds don't work in current Scrapy version.
My spider is defined in jingdong_spider.py, and the interface(learn it by Scrapy Documentation) to run spider is below:
# interface
def search(keyword):
configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})
runner = CrawlerRunner()
d = runner.crawl(JingdongSpider,keyword)
d.addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until the crawling is finished
Then in temp.py I will call the search(keyword)
above to run spider.
Now the problem: I called search(keyword) once, and it worked well.But I called it twice, for instance,
in temp.py
search('iphone')
search('ipad2')
it reported:
Traceback (most recent call last): File "C:/Users/jiahao/Desktop/code/bbt_climb_plus/temp.py", line 7, in
search('ipad2') File "C:\Users\jiahao\Desktop\code\bbt_climb_plus\bbt_climb_plus\spiders\jingdong_spider.py", line 194, in search reactor.run() # the script will block here until the crawling is finished File "C:\Python27\lib\site-packages\twisted\internet\base.py", line 1193, in run self.startRunning(installSignalHandlers=installSignalHandlers) File "C:\Python27\lib\site-packages\twisted\internet\base.py", line 1173, in startRunning ReactorBase.startRunning(self) File "C:\Python27\lib\site-packages\twisted\internet\base.py", line 684, in startRunning raise error.ReactorNotRestartable() twisted.internet.error.ReactorNotRestartable
The first search(keyword) succeeded, but the latter got wrong.
Could you help?
In your code sample you are making calls to twisted.reactor starting it on every function call. This is not working because there is only one reactor per process and you cannot start it twice.
There are two ways to solve your problem, both described in documentation here. Either stick with CrawlerRunner
but move reactor.run()
outside your search()
function to ensure it is only called once. Or use CrawlerProcess
and simply call crawler_process.start()
. Second approach is easier, your code would look like this:
from scrapy.crawler import CrawlerProcess
from dirbot.spiders.dmoz import DmozSpider
def search(runner, keyword):
return runner.crawl(DmozSpider, keyword)
runner = CrawlerProcess()
search(runner, "alfa")
search(runner, "beta")
runner.start()