[英]How can I avoid JSON percent-encoding and \u-escaping?
當我解析文件時
<html>
<head><meta charset="UTF-8"></head>
<body><a href="Düsseldorf.html">Düsseldorf</a></body>
</html>
使用
item = SimpleItem()
item['name'] = response.xpath('//a/text()')[0].extract()
item["url"] = response.xpath('//a/@href')[0].extract()
return item
我最終要么
\\u\u003c/code>轉義
[{
"name": "D\u00fcsseldorf",
"url": "D\u00fcsseldorf.html"
}]
或帶有百分比編碼的字符串
D%C3%BCsseldorf
# -*- coding: utf-8 -*-
import json
from scrapy.contrib.exporter import BaseItemExporter
class UnicodeJsonLinesItemExporter(BaseItemExporter):
def __init__(self, file, **kwargs):
self._configure(kwargs)
self.file = file
self.encoder = json.JSONEncoder(ensure_ascii=False, **kwargs)
def export_item(self, item):
itemdict = dict(self._get_serialized_fields(item))
self.file.write(self.encoder.encode(itemdict) + '\n')
以及適當的Feed導出程序設置
FEED_EXPORTERS = {
'json': 'myproj.exporter.UnicodeJsonLinesItemExporter',
}
沒有幫助。
如何獲得utf-8編碼的JSON輸出?
我要重申/擴大一個未解決的問題 。
更新 :
與Scrapy正交,請注意,未設置
export PYTHONIOENCODING="utf_8"
跑步
> echo { \"name\": \"Düsseldorf\", \"url\": \"Düsseldorf.html\" } > dorf.json
> python -c'import fileinput, json;print json.dumps(json.loads("".join(fileinput.input())),sort_keys=True, indent=4, ensure_ascii=False)' dorf.json > dorf_pp.json
將失敗
Traceback (most recent call last):
File "<string>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 16: ordinal not in range(128)
更新
如前所述,我的問題無法回答。 UnicodeJsonLinesItemExporter可以工作,但管道的另一部分是罪魁禍首:作為漂亮地打印JSON輸出的后期處理,我正在使用
python -m json.tool in.json > out.json
。
>>> a = [{
"name": "D\u00fcsseldorf",
"url": "D\u00fcsseldorf.html"
}]
>>> a
[{'url': 'Düsseldorf.html', 'name': 'Düsseldorf'}]
>>> json.dumps(a, ensure_ascii=False)
'[{"url": "Düsseldorf.html", "name": "Düsseldorf"}]'
這似乎對我有用
# -*- coding: utf-8 -*-
import scrapy
import urllib
class SimpleItem(scrapy.Item):
name = scrapy.Field()
url = scrapy.Field()
class CitiesSpider(scrapy.Spider):
name = "cities"
allowed_domains = ["sitercity.info"]
start_urls = (
'http://en.sistercity.info/countries/de.html',
)
def parse(self, response):
for a in response.css('a'):
item = SimpleItem()
item['name'] = a.css('::text').extract_first()
item['url'] = urllib.unquote(
a.css('::attr(href)').extract_first().encode('ascii')
).decode('utf8')
yield item
使用您問題中引用的Feed導出程序,它還可以使用另一個存儲設備
# -*- coding: utf-8 -*-
import json
import io
import os
from scrapy.contrib.exporter import BaseItemExporter
from w3lib.url import file_uri_to_path
class CustomFileFeedStorage(object):
def __init__(self, uri):
self.path = file_uri_to_path(uri)
def open(self, spider):
dirname = os.path.dirname(self.path)
if dirname and not os.path.exists(dirname):
os.makedirs(dirname)
return io.open(self.path, mode='ab')
def store(self, file):
file.close()
class UnicodeJsonLinesItemExporter(BaseItemExporter):
def __init__(self, file, **kwargs):
self._configure(kwargs)
self.file = file
self.encoder = json.JSONEncoder(ensure_ascii=False, **kwargs)
def export_item(self, item):
itemdict = dict(self._get_serialized_fields(item))
self.file.write(self.encoder.encode(itemdict) + '\n')
(如有必要,刪除評論)
FEED_EXPORTERS = {
'json': 'myproj.exporter.UnicodeJsonLinesItemExporter'
}
#FEED_STORAGES = {
# '': 'myproj.exporter.CustomFileFeedStorage'
#}
FEED_FORMAT = 'json'
FEED_URI = "out.json"
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.