简体   繁体   English

Scrapy-如何在我的输出上转换unicode?

[英]Scrapy - How to convert the unicode on my output?

I'm scraping a website and the titles have latin accents Ex: É, não, etc... 我正在抓捕一个网站,标题中有拉丁语重音,例如:É,não等。

This is my code: 这是我的代码:

    for tank in response.xpath('//html/body/div/div[4]/div/div/div/table[1]/tr/td/div'):
        item = VapeItem()
        item["title"] = tank.xpath("h3/a/text()").extract()

And the Json Output example: 和Json Output示例:

{"title": "HALO CAF\u00c9 MOCHA"},

Question is: How do I convert this so it shows up like this? 问题是:如何转换此格式,使其显示如下?

 {"title": "HALO CAFÉ MOCHA"},

I've tried encode("utf8") without success. 我尝试了encode(“ utf8”)失败。

You probably need to just print it? 您可能只需要打印它?

>>> print json.loads(txt)['title']

HALO CAFÉ MOCHA

Writing to a file works just as well, don't really see the problem here. 写入文件也一样,在这里看不到真正的问题。

>>> parsed_data = json.loads('{"title": "HALO CAF\u00c9 MOCHA"}')
>>> with open('foo.txt', 'w') as fin:
...   fin.write(parsed_data['title'].encode('utf-8'))
... 

You've got it backwards. 你已经倒退了。 You need to decode as utf-8 (to convert from bytes-like str data to unicode ). 您需要decodeutf-8 (以将类似字节的str数据转换为unicode )。

But that's not the real problem: json dump ensures ASCII compatible output by default (using escapes) to avoid problems with protocols that only handle ASCII (or can't rely on a specific encoding besides "ASCII compatible"). 但这不是真正的问题: json dump默认情况下(使用转义符)确保ASCII兼容输出,以避免仅处理ASCII(或“除ASCII兼容”以外不能依赖特定编码)的协议出现问题。

Pass ensure_ascii=False to the dump / dumps call to allow it to output non-ASCII. 通过ensure_ascii=Falsedump / dumps调用它允许输出非ASCII。 Note the warnings on the docs; 注意文档上的警告; this can make some calls return str , others unicode , which may cause problems (on Py3, the issues aren't there; it's always str ). 这可能会使某些调用返回str ,而另一些则返回unicode ,这可能会导致问题(在Py3上,问题不存在;它始终是str )。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM