BotProxy: Rotating Proxies Made for professionals. Really fast connection. Built-in IP rotation. Fresh IPs every day.
I'm trying to pull data from this page using Scrapy: https://www.interpol.int/notice/search/woa/1192802
The spider will crawl multiple pages but I have excluded the pagination code here to keep things simple. The problem is that the number of table rows that I want to scrape on each page can change each time.
So I need a way of scraping all the table data from the page no matter how many table rows it has.
First, I extracted all the table rows on the page. Then, I created a blank dictionary. Next, I tried to loop through each row and put it's cell data into the dictionary.
But it does not work and it is returning a blank file.
Any idea what's wrong?
# -*- coding: utf-8 -*-
import scrapy
class Test1Spider(scrapy.Spider):
name = 'test1'
allowed_domains = ['interpol.int']
start_urls = ['https://www.interpol.int/notice/search/woa/1192802']
def parse(self, response):
table_rows = response.xpath('//*[contains(@class,"col_gauche2_result_datasheet")]//tr').extract()
data = {}
for table_row in table_rows:
data.update({response.xpath('//td[contains(@class, "col1")]/text()').extract(): response.css('//td[contains(@class, "col2")]/text()').extract()})
yield data
What is this?
response.css('//td[contains(@class, "col2")]/text()').extract()
You are calling css()
method but you are giving it a xpath
Anyways, here is the 100% working code, I have tested it.
table_rows = response.xpath('//*[contains(@class,"col_gauche2_result_datasheet")]//tr')
data = {}
for table_row in table_rows:
data[table_row.xpath('td[@class="col1"]/text()').extract_first().strip()] = table_row.xpath('td[@class="col2 strong"]/text()').extract_first().strip()
yield data
EDIT:
To remove the characters like \t\n\r
etc, use regex.
import re
your_string = re.sub('\\t|\\n|\\r', '', your_string)