Need Proxy?

BotProxy: Rotating Proxies Made for professionals. Really fast connection. Built-in IP rotation. Fresh IPs every day.

Find out more

Scrapy pull data from table rows


I'm trying to pull data from this page using Scrapy:

The spider will crawl multiple pages but I have excluded the pagination code here to keep things simple. The problem is that the number of table rows that I want to scrape on each page can change each time.

So I need a way of scraping all the table data from the page no matter how many table rows it has.

First, I extracted all the table rows on the page. Then, I created a blank dictionary. Next, I tried to loop through each row and put it's cell data into the dictionary.

But it does not work and it is returning a blank file.

Any idea what's wrong?

# -*- coding: utf-8 -*-
import scrapy

class Test1Spider(scrapy.Spider):
    name = 'test1'
    allowed_domains = ['']
    start_urls = ['']

    def parse(self, response):
        table_rows = response.xpath('//*[contains(@class,"col_gauche2_result_datasheet")]//tr').extract()
        data = {}
        for table_row in table_rows:
            data.update({response.xpath('//td[contains(@class, "col1")]/text()').extract(): response.css('//td[contains(@class, "col2")]/text()').extract()})
        yield data


What is this?

response.css('//td[contains(@class, "col2")]/text()').extract()

You are calling css() method but you are giving it a xpath

Anyways, here is the 100% working code, I have tested it.

table_rows = response.xpath('//*[contains(@class,"col_gauche2_result_datasheet")]//tr')
data = {}
for table_row in table_rows:
    data[table_row.xpath('td[@class="col1"]/text()').extract_first().strip()] = table_row.xpath('td[@class="col2 strong"]/text()').extract_first().strip()
yield data


To remove the characters like \t\n\r etc, use regex.

import re
your_string = re.sub('\\t|\\n|\\r', '', your_string)

cc by-sa 3.0