简体   繁体   English


[英]Scrapy output feed international unicode characters (e.g. Japanese chars)

I'm a newbie to python and scrapy and I'm following the dmoz tutorial. 我是python和scrapy的新手,我正在关注dmoz教程。 As a minor variant to the tutorial's suggested start URL, I chose a Japanese category from the dmoz sample site and noticed that the feed export I eventually get shows the unicode numeric values instead of the actual Japanese characters. 作为教程建议的起始URL的一个小变体,我从dmoz示例站点中选择了一个日语类别,并注意到我最终获得的feed导出显示的是unicode数值而不是实际的日语字符。

It seems like I need to use TextResponse somehow, but I'm not sure how to make my spider use that object instead of the base Response object. 看起来我需要以某种方式使用TextResponse ,但我不知道如何让我的蜘蛛使用该对象而不是基本的Response对象。

  1. How should I modify my code to show the Japanese chars in my output? 我应该如何修改我的代码以在输出中显示日语字符?
  2. How do I get rid of the square brackers, the single quotes, and the 'u' that's wrapping my output values? 如何摆脱方括号,单引号和包含输出值的'u'?

Ultimately, I want to have an output of say 最终,我希望有一个输出说

オンラインショップ (these are japanese chars) オンラインショップ (这些是日本字符)

instead of the current output of 而不是当前的输出

[u'\オ\ン\ラ\イ\ン\シ\ョ\ッ\プ'] (the unicodes) [u'\\ u30aa \\ u30f3 \\ u30e9 \\ u30a4 \\ u30f3 \\ u30b7 \\ u30e7 \\ u30c3 \\ u30d7'] (unicodes)

If you look at my screenshot, it corresponds to cell C7, one of the text titles. 如果您查看我的屏幕截图,它对应于单元格C7,其中一个文本标题。

Here's my spider (identical to the one in the tutorial, except for different start_url): 这是我的蜘蛛(与教程中的蜘蛛相同,除了不同的start_url):

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector

from dmoz.items import DmozItem

class DmozSpider(BaseSpider):
   name = "dmoz.org"
   allowed_domains = ["dmoz.org"]
   start_urls = [

   def parse(self, response):
       hxs = HtmlXPathSelector(response)
       sites = hxs.select('//ul/li')
       items = []
       for site in sites:
           item = DmozItem()
           item['title'] = site.select('a/text()').extract()
           item['link'] = site.select('a/@href').extract()
           item['desc'] = site.select('text()').extract()
       return items

settings.py: settings.py:

FEED_URI = 'items.csv'

output screenshot: http://i55.tinypic.com/eplwlj.png (sorry I don't have enough SO points yet to post images) 输出截图: http//i55.tinypic.com/eplwlj.png (抱歉,我还没有足够的SO点发布图片)

When you scrape the text from the page it is stored in Unicode. 当您从页面中刮取文本时,它将以Unicode格式存储。

What you want to do is encode it into something like UTF8. 你想要做的是将其编码为类似UTF8的东西。


Also, when you extract the text using your selector, it is stored in a list even if there is only one result, so you need to pick the first element. 此外,当您使用选择器提取文本时,即使只有一个结果,它也会存储在列表中,因此您需要选择第一个元素。


声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 将带有unicode字符的字符串(例如→,∧,¬)转换为乳胶所示的字符串? - Convert string with unicode characters e.g. →,∧,¬ into strings illustrated in latex? 如何在Windows cmd上将不支持的unicode字符打印为“?”而不是引发异常? - How to print unsupported unicode characters on Windows cmd as e.g. “?” instead of raising exception? 全屏终端output(例如在网格上) - Fullscreen terminal output (e.g. on a grid) 如何将 python 中带有特殊字符的字符串传递给 os.system。 (例如 python test.py |& tee output.txt) - How to pass a string with special chars in python to os.system. (E.g. python test.py |& tee output.txt) 如何解决 flask 生产中的 Unicode 问题,例如 Ieeo? - How to solve Unicode problem in flask production e.g. Ieeo? Xcode 4.2(内部版本4D199)+ Python:控制台输出与预期的不同(例如,没有UTF-8字符) - Xcode 4.2 (build 4D199) + Python: Console output is different to the expected (e.g. no UTF-8 characters) 和弦字典(Python)中特殊字符(例如#、/)的正则表达式问题 - Regex problems with special characters (e.g. #, /) in a chord dictionary (Python) cra草不接受蜘蛛中的日语字符 - Scrapy not accepting japanese characters in spider 如何从 for 循环 output 创建结构(例如列表)? - How to Create a structure (e.g. a list) from a for loop output? 打印到 Tkinter Window 而不是程序 output 框(例如 PyCharm) - Printing to Tkinter Window and not to the program output box (e.g. PyCharm)
粤ICP备18138465号  © 2020-2024 STACKOOM.COM