Scrapy is a robust web scraping framework for Python, but when working with its output, especially in CSV format, you might encounter unexpected carriage returns (\n
). This can lead to misaligned fields and incomplete data when scraping sites like TripAdvisor. Let’s explore how to address this issue.
The Problem
A user scraping TripAdvisor observed that each review contained carriage returns, causing their CSV output to have more columns than expected. This discrepancy led to missing fields. Below is the problematic Scrapy spider:
from scrapy.spiders import Spider
from scrapy.selector import Selector
from scrapy.http import Request
from scrapingtest.items import ScrapingTestingItem
from collections import OrderedDict
import html2text
import unicodedata
class ScrapingTestSpider(Spider):
name = "scrapytesting"
allowed_domains = ["tripadvisor.in"]
start_urls = [
"http://www.tripadvisor.in/Hotel_Review-g297679-d736080-Reviews-Ooty_Elk_Hill_A_Sterling_Holidays_Resort-Ooty_Tamil_Nadu.html"
]
def parse(self, response):
item = ScrapingTestingItem()
sel = Selector(response)
item['reviews'] = sel.xpath('//div[@class="col2of2"]//p[@class="partial_entry"]/text()').extract()
item['subjects'] = sel.xpath('//span[@class="noQuotes"]/text()').extract()
item['stars'] = sel.xpath('//*[@class="rating reviewItemInline"]//img/@alt').extract()
item['names'] = sel.xpath('//*[@class="username mo"]/span/text()').extract()
item['location'] = sel.xpath('//*[@class="location"]/text()').extract()
item['date'] = sel.xpath('//*[@class="ratingDate relativeDate"]/@title').extract()
item['date'] += sel.xpath('//div[@class="col2of2"]//span[@class="ratingDate"]/text()').extract()
for i in range(len(item['reviews'])):
item['reviews'][i] = unicodedata.normalize('NFKD', item['reviews'][i]).strip()
for j in range(len(item['subjects'])):
item['subjects'][j] = unicodedata.normalize('NFKD', item['subjects'][j]).strip()
yield item
next_pages = sel.xpath('//a[contains(text(), "Next")]/@href').extract()
if next_pages:
for next_page in next_pages:
yield Request(url="http://tripadvisor.in" + next_page, callback=self.parse)
Solution: Using Item Loaders and Processors
To clean and standardize data, leveraging Scrapy’s ItemLoader
along with input and output processors is an effective approach. Here’s how:
- Define the ItemLoader
Create an item loader with processors to trim whitespace and handle carriage returns:
from scrapy.loader import ItemLoader
from scrapy.loader.processors import MapCompose, TakeFirst
class ScrapingTestingLoader(ItemLoader):
default_input_processor = MapCompose(str.strip)
default_output_processor = TakeFirst()
- Update the Spider
Modify the spider to use the new loader:
from scrapingtest.items import ScrapingTestingItem
from scrapingtest.loaders import ScrapingTestingLoader
class ScrapingTestSpider(Spider):
# Existing code
def parse(self, response):
loader = ScrapingTestingLoader(item=ScrapingTestingItem(), response=response)
loader.add_xpath('reviews', '//div[@class="col2of2"]//p[@class="partial_entry"]/text()')
loader.add_xpath('subjects', '//span[@class="noQuotes"]/text()')
loader.add_xpath('stars', '//*[@class="rating reviewItemInline"]//img/@alt')
loader.add_xpath('names', '//*[@class="username mo"]/span/text()')
loader.add_xpath('location', '//*[@class="location"]/text()')
loader.add_xpath('date', '//*[@class="ratingDate relativeDate"]/@title')
yield loader.load_item()
# Handle pagination
next_pages = response.xpath('//a[contains(text(), "Next")]/@href').extract()
for next_page in next_pages:
yield Request(url="http://tripadvisor.in" + next_page, callback=self.parse)
Explanation
- Input Processor (
MapCompose
): Applies transformations (e.g., stripping whitespace) to each extracted value. - Output Processor (
TakeFirst
): Returns a single value from a list of extracted items.
With this setup, fields like reviews
will automatically have their \n
characters removed and appear as clean strings in the output.
Conclusion
Scrapy’s ItemLoader
simplifies data cleaning and standardization. By using input and output processors, you can easily eliminate unwanted characters like \n
, ensuring consistent and accurate data for further analysis or export. Happy scraping!