I have created a crawlspider with Scrapy. I need to get a specific part of the page with a Xpath :
item = ExplorerItem()
item['article'] = response.xpath("//div[@class='post-content']").extract()
Then I am using this item in pipelines.py.
But item['article']
gives me a result in unicode:
`u'<div class="post-content">\n\t\t\t\t\t<h2>D\xe9signation</h2>\n<p>`
I need to convert it in UTF-8 .
What you are seeing are unicode characters when you see \\xe9 \\xe7. These are unicode characters. You may have some luck with this module Unidecode I have used it before with success, but those characters are fine I think your console just isn't set to render them. Web pages or source data doesn't always tell the truth about its encoding. Often data is a jumble of encodings. Unidecode will do its best to represent the character in ASCII.
from unidecode import unidecode
unidecode(u"\u5317\u4EB0") # Note the u before the string on this line stands for unicode
Set FEED_EXPORT_ENCODING='utf-8'
i settings.py
See docs here https://doc.scrapy.org/en/latest/topics/feed-exports.html#std:setting-FEED_EXPORT_ENCODING
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.