Encoding issue with Scrapy (Python)

Question

I have created a crawlspider with Scrapy. I need to get a specific part of the page with a Xpath :

item = ExplorerItem()
item['article'] = response.xpath("//div[@class='post-content']").extract()

Then I am using this item in pipelines.py.

But item['article'] gives me a result in unicode:

`u'<div class="post-content">\n\t\t\t\t\t<h2>D\xe9signation</h2>\n<p>`

I need to convert it in UTF-8 .

Answer 1

What you are seeing are unicode characters when you see \\xe9 \\xe7. These are unicode characters. You may have some luck with this module Unidecode I have used it before with success, but those characters are fine I think your console just isn't set to render them. Web pages or source data doesn't always tell the truth about its encoding. Often data is a jumble of encodings. Unidecode will do its best to represent the character in ASCII.

from unidecode import unidecode
unidecode(u"\u5317\u4EB0")  # Note the u before the string on this line stands for unicode

Answer 2

Set FEED_EXPORT_ENCODING='utf-8' i settings.py

See docs here https://doc.scrapy.org/en/latest/topics/feed-exports.html#std:setting-FEED_EXPORT_ENCODING

Encoding issue with Scrapy (Python)

Question

2 answers

solution1
1 ACCPTED 2017-11-28 12:52:43

solution2
0 2017-11-28 06:56:47

Encoding issue with Scrapy (Python)

Question

2 answers

solution1 1 ACCPTED 2017-11-28 12:52:43

solution2 0 2017-11-28 06:56:47

solution1
1 ACCPTED 2017-11-28 12:52:43

solution2
0 2017-11-28 06:56:47