Scrapy: extract text with special characters

Question

I'm using Scrapy for extract text from some spanish websites. Obviously, the text is written in spanish and some words have special characters like 'ñ' or 'í'. My problem is that when I run in the command line: scrapy crawl econoticia -o prueba.json to get the file with the scraped data, some characters are not shown in a proper way. For example: This is the original text "La exministra, procesada como partícipe a titulo lucrativo, intenta burlar a los fotógrafos" And this is the text scraped "La exministra, procesada como part\ícipe a titulo lucrativo, intenta burlar a los fot\ógrafos" I wish to return a json with the special characters. I presume that my spyder code need something to get the json in the right way. This is my spyder code:

# -*- coding: utf-8 -*-
import scrapy
from scrapy.selector import HtmlXPathSelector
from pais.items import PaisItem


class NoticiaSpider(scrapy.Spider):
   name = "noticia"
   allowed_domains = ["elpais.com"]
start_urls = (...

)

def parse(self, response):

    hxs = HtmlXPathSelector(response)        
    item= PaisItem()
    item['subtitulo']=hxs.select('//*[@id="merc"]/div[2]/div[4]/div[1]/div[1]/span/text()').extract()
    item['titular']=hxs.select('//*[@id="merc"]/div[2]/div[4]/div[1]/div[3]/div[2]/div[1]/h1/a/text()').extract()
    return item

Answer 1

也许您应该在extract（）之后添加.encode（'utf8'）

Answer 2

When you write the characters to the file, you need to encode them as UTF-8. Try changing the last lines of your example to the following:

item['subtitulo']=hxs.select('//*[@id="merc"]/div[2]/div[4]/div[1]/div[1]/span/text()').extract().encode('utf-8')
item['titular']=hxs.select('//*[@id="merc"]/div[2]/div[4]/div[1]/div[3]/div[2]/div[1]/h1/a/text()').extract().encode('utf-8')
return item

Scrapy: extract text with special characters

Question

2 answers

solution1
0 2015-03-13 13:18:16

solution2
0 2015-03-13 13:18:18

Scrapy: extract text with special characters

Question

2 answers

solution1 0 2015-03-13 13:18:16

solution2 0 2015-03-13 13:18:18

solution1
0 2015-03-13 13:18:16

solution2
0 2015-03-13 13:18:18