简体   繁体   English

Scrapy输出提供国际unicode字符(例如日语字符)

[英]Scrapy output feed international unicode characters (e.g. Japanese chars)

I'm a newbie to python and scrapy and I'm following the dmoz tutorial. 我是python和scrapy的新手,我正在关注dmoz教程。 As a minor variant to the tutorial's suggested start URL, I chose a Japanese category from the dmoz sample site and noticed that the feed export I eventually get shows the unicode numeric values instead of the actual Japanese characters. 作为教程建议的起始URL的一个小变体,我从dmoz示例站点中选择了一个日语类别,并注意到我最终获得的feed导出显示的是unicode数值而不是实际的日语字符。

It seems like I need to use TextResponse somehow, but I'm not sure how to make my spider use that object instead of the base Response object. 看起来我需要以某种方式使用TextResponse ,但我不知道如何让我的蜘蛛使用该对象而不是基本的Response对象。

  1. How should I modify my code to show the Japanese chars in my output? 我应该如何修改我的代码以在输出中显示日语字符?
  2. How do I get rid of the square brackers, the single quotes, and the 'u' that's wrapping my output values? 如何摆脱方括号,单引号和包含输出值的'u'?

Ultimately, I want to have an output of say 最终,我希望有一个输出说

オンラインショップ (these are japanese chars) オンラインショップ (这些是日本字符)

instead of the current output of 而不是当前的输出

[u'\オ\ン\ラ\イ\ン\シ\ョ\ッ\プ'] (the unicodes) [u'\\ u30aa \\ u30f3 \\ u30e9 \\ u30a4 \\ u30f3 \\ u30b7 \\ u30e7 \\ u30c3 \\ u30d7'] (unicodes)

If you look at my screenshot, it corresponds to cell C7, one of the text titles. 如果您查看我的屏幕截图,它对应于单元格C7,其中一个文本标题。

Here's my spider (identical to the one in the tutorial, except for different start_url): 这是我的蜘蛛(与教程中的蜘蛛相同,除了不同的start_url):

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector

from dmoz.items import DmozItem

class DmozSpider(BaseSpider):
   name = "dmoz.org"
   allowed_domains = ["dmoz.org"]
   start_urls = [
       "http://www.dmoz.org/World/Japanese/"
   ]

   def parse(self, response):
       hxs = HtmlXPathSelector(response)
       sites = hxs.select('//ul/li')
       items = []
       for site in sites:
           item = DmozItem()
           item['title'] = site.select('a/text()').extract()
           item['link'] = site.select('a/@href').extract()
           item['desc'] = site.select('text()').extract()
           items.append(item)
       return items

settings.py: settings.py:

FEED_URI = 'items.csv'
FEED_FORMAT = 'csv'

output screenshot: http://i55.tinypic.com/eplwlj.png (sorry I don't have enough SO points yet to post images) 输出截图: http//i55.tinypic.com/eplwlj.png (抱歉,我还没有足够的SO点发布图片)

When you scrape the text from the page it is stored in Unicode. 当您从页面中刮取文本时,它将以Unicode格式存储。

What you want to do is encode it into something like UTF8. 你想要做的是将其编码为类似UTF8的东西。

unicode_string.encode('utf-8')

Also, when you extract the text using your selector, it is stored in a list even if there is only one result, so you need to pick the first element. 此外,当您使用选择器提取文本时,即使只有一个结果,它也会存储在列表中,因此您需要选择第一个元素。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 将带有unicode字符的字符串(例如→,∧,¬)转换为乳胶所示的字符串? - Convert string with unicode characters e.g. →,∧,¬ into strings illustrated in latex? 如何在Windows cmd上将不支持的unicode字符打印为“?”而不是引发异常? - How to print unsupported unicode characters on Windows cmd as e.g. “?” instead of raising exception? 全屏终端output(例如在网格上) - Fullscreen terminal output (e.g. on a grid) 如何将 python 中带有特殊字符的字符串传递给 os.system。 (例如 python test.py |& tee output.txt) - How to pass a string with special chars in python to os.system. (E.g. python test.py |& tee output.txt) 如何解决 flask 生产中的 Unicode 问题,例如 Ieeo? - How to solve Unicode problem in flask production e.g. Ieeo? Xcode 4.2(内部版本4D199)+ Python:控制台输出与预期的不同(例如,没有UTF-8字符) - Xcode 4.2 (build 4D199) + Python: Console output is different to the expected (e.g. no UTF-8 characters) 和弦字典(Python)中特殊字符(例如#、/)的正则表达式问题 - Regex problems with special characters (e.g. #, /) in a chord dictionary (Python) cra草不接受蜘蛛中的日语字符 - Scrapy not accepting japanese characters in spider 如何从 for 循环 output 创建结构(例如列表)? - How to Create a structure (e.g. a list) from a for loop output? 打印到 Tkinter Window 而不是程序 output 框(例如 PyCharm) - Printing to Tkinter Window and not to the program output box (e.g. PyCharm)
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM