Scrapy is a robust web scraping framework for Python, but when working with its output, especially in CSV format, you might encounter unexpected carriage returns (\n). This can lead to misaligned fields and incomplete data when scraping sites like TripAdvisor. Let’s explore how to address this issue.


The Problem

A user scraping TripAdvisor observed that each review contained carriage returns, causing their CSV output to have more columns than expected. This discrepancy led to missing fields. Below is the problematic Scrapy spider:

from scrapy.spiders import Spider
from scrapy.selector import Selector
from scrapy.http import Request
from scrapingtest.items import ScrapingTestingItem
from collections import OrderedDict
import html2text
import unicodedata

class ScrapingTestSpider(Spider):
    name = "scrapytesting"
    allowed_domains = ["tripadvisor.in"]
    start_urls = [
        "http://www.tripadvisor.in/Hotel_Review-g297679-d736080-Reviews-Ooty_Elk_Hill_A_Sterling_Holidays_Resort-Ooty_Tamil_Nadu.html"
    ]

    def parse(self, response):
        item = ScrapingTestingItem()
        sel = Selector(response)

        item['reviews'] = sel.xpath('//div[@class="col2of2"]//p[@class="partial_entry"]/text()').extract()
        item['subjects'] = sel.xpath('//span[@class="noQuotes"]/text()').extract()
        item['stars'] = sel.xpath('//*[@class="rating reviewItemInline"]//img/@alt').extract()
        item['names'] = sel.xpath('//*[@class="username mo"]/span/text()').extract()
        item['location'] = sel.xpath('//*[@class="location"]/text()').extract()
        item['date'] = sel.xpath('//*[@class="ratingDate relativeDate"]/@title').extract()
        item['date'] += sel.xpath('//div[@class="col2of2"]//span[@class="ratingDate"]/text()').extract()

        for i in range(len(item['reviews'])):
            item['reviews'][i] = unicodedata.normalize('NFKD', item['reviews'][i]).strip()

        for j in range(len(item['subjects'])):
            item['subjects'][j] = unicodedata.normalize('NFKD', item['subjects'][j]).strip()

        yield item

        next_pages = sel.xpath('//a[contains(text(), "Next")]/@href').extract()
        if next_pages:
            for next_page in next_pages:
                yield Request(url="http://tripadvisor.in" + next_page, callback=self.parse)

Solution: Using Item Loaders and Processors

To clean and standardize data, leveraging Scrapy’s ItemLoader along with input and output processors is an effective approach. Here’s how:

  1. Define the ItemLoader

Create an item loader with processors to trim whitespace and handle carriage returns:

from scrapy.loader import ItemLoader
from scrapy.loader.processors import MapCompose, TakeFirst

class ScrapingTestingLoader(ItemLoader):
    default_input_processor = MapCompose(str.strip)
    default_output_processor = TakeFirst()
  1. Update the Spider

Modify the spider to use the new loader:

from scrapingtest.items import ScrapingTestingItem
from scrapingtest.loaders import ScrapingTestingLoader

class ScrapingTestSpider(Spider):
    # Existing code

    def parse(self, response):
        loader = ScrapingTestingLoader(item=ScrapingTestingItem(), response=response)
        loader.add_xpath('reviews', '//div[@class="col2of2"]//p[@class="partial_entry"]/text()')
        loader.add_xpath('subjects', '//span[@class="noQuotes"]/text()')
        loader.add_xpath('stars', '//*[@class="rating reviewItemInline"]//img/@alt')
        loader.add_xpath('names', '//*[@class="username mo"]/span/text()')
        loader.add_xpath('location', '//*[@class="location"]/text()')
        loader.add_xpath('date', '//*[@class="ratingDate relativeDate"]/@title')
        yield loader.load_item()

        # Handle pagination
        next_pages = response.xpath('//a[contains(text(), "Next")]/@href').extract()
        for next_page in next_pages:
            yield Request(url="http://tripadvisor.in" + next_page, callback=self.parse)

Explanation

  • Input Processor (MapCompose): Applies transformations (e.g., stripping whitespace) to each extracted value.
  • Output Processor (TakeFirst): Returns a single value from a list of extracted items.

With this setup, fields like reviews will automatically have their \n characters removed and appear as clean strings in the output.


Conclusion

Scrapy’s ItemLoader simplifies data cleaning and standardization. By using input and output processors, you can easily eliminate unwanted characters like \n, ensuring consistent and accurate data for further analysis or export. Happy scraping!