简体   繁体   中英

Encoding issue with Scrapy (Python)

I have created a crawlspider with Scrapy. I need to get a specific part of the page with a Xpath :

item = ExplorerItem()
item['article'] = response.xpath("//div[@class='post-content']").extract()

Then I am using this item in pipelines.py.

But item['article'] gives me a result in unicode:

`u'<div class="post-content">\n\t\t\t\t\t<h2>D\xe9signation</h2>\n<p>`

I need to convert it in UTF-8 .

What you are seeing are unicode characters when you see \\xe9 \\xe7. These are unicode characters. You may have some luck with this module Unidecode I have used it before with success, but those characters are fine I think your console just isn't set to render them. Web pages or source data doesn't always tell the truth about its encoding. Often data is a jumble of encodings. Unidecode will do its best to represent the character in ASCII.

在此处输入图片说明

from unidecode import unidecode
unidecode(u"\u5317\u4EB0")  # Note the u before the string on this line stands for unicode

Set FEED_EXPORT_ENCODING='utf-8' i settings.py

See docs here https://doc.scrapy.org/en/latest/topics/feed-exports.html#std:setting-FEED_EXPORT_ENCODING

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM