简体   繁体   English

Scrapy:提取带有特殊字符的文本

[英]Scrapy: extract text with special characters

I'm using Scrapy for extract text from some spanish websites. 我正在使用Scrapy从一些西班牙网站中提取文本。 Obviously, the text is written in spanish and some words have special characters like 'ñ' or 'í'. 显然,文本是用西班牙语写的,有些单词带有特殊字符,如“ñ”或“í”。 My problem is that when I run in the command line: scrapy crawl econoticia -o prueba.json to get the file with the scraped data, some characters are not shown in a proper way. 我的问题是,当我在命令行中运行时:scrapy crawl econoticia -o prueba.json要获取包含已抓取数据的文件,某些字符将无法正确显示。 For example: This is the original text "La exministra, procesada como partícipe a titulo lucrativo, intenta burlar a los fotógrafos" And this is the text scraped "La exministra, procesada como part\ícipe a titulo lucrativo, intenta burlar a los fot\ógrafos" I wish to return a json with the special characters. 例如:这是原始文本“ La exministra,procesada comopartícipe一个ticulo lucrativo,意图的burlar a losfotógrafos”,这是文本“ La exministra,procesada como part \\ u00edcipe一个ticulo lucrativo,intenta burlar a los fot \\” u00f3grafos”,我希望返回带有特殊字符的json。 I presume that my spyder code need something to get the json in the right way. 我认为我的spyder代码需要一些东西才能以正确的方式获取json。 This is my spyder code: 这是我的spyder代码:

# -*- coding: utf-8 -*-
import scrapy
from scrapy.selector import HtmlXPathSelector
from pais.items import PaisItem


class NoticiaSpider(scrapy.Spider):
   name = "noticia"
   allowed_domains = ["elpais.com"]
start_urls = (...

)

def parse(self, response):

    hxs = HtmlXPathSelector(response)        
    item= PaisItem()
    item['subtitulo']=hxs.select('//*[@id="merc"]/div[2]/div[4]/div[1]/div[1]/span/text()').extract()
    item['titular']=hxs.select('//*[@id="merc"]/div[2]/div[4]/div[1]/div[3]/div[2]/div[1]/h1/a/text()').extract()
    return item

也许您应该在extract()之后添加.encode('utf8')

When you write the characters to the file, you need to encode them as UTF-8. 将字符写入文件时,需要将其编码为UTF-8。 Try changing the last lines of your example to the following: 尝试将示例的最后几行更改为以下内容:

item['subtitulo']=hxs.select('//*[@id="merc"]/div[2]/div[4]/div[1]/div[1]/span/text()').extract().encode('utf-8')
item['titular']=hxs.select('//*[@id="merc"]/div[2]/div[4]/div[1]/div[3]/div[2]/div[1]/h1/a/text()').extract().encode('utf-8')
return item

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM