简体   繁体   English

Python加载UTF-8 JSON

[英]Python Load UTF-8 JSON

I have the following JSON (for simplicity's sake I'll only use one but there are 100 entries in reality): 我有以下JSON(为简单起见,我只使用一个,但实际上有100个条目):

{
    "Active": false, 
    "Book": "US Derivat. London, Mike Übersax/Michael Jealous", 
    "ExpirationDate": "2006-10-12", 
    "Isin": "CH0013096497", 
    "IssueDate": "2001-10-09", 
    "KbForXMonths": "0", 
    "KbPeriodDay": "Period", 
    "KbType": "Prozent", 
    "KbYear": "0.5", 
    "Keyinvest_IssueRetro": "0.50%", 
    "Keyinvest_RecurringRetro": "1.00% pro rata temporis", 
    "Keyinvest_RetroPayment": "Every month", 
    "LastImportDate": "2008-12-31", 
    "LiberierungDate": "1900-01-01", 
    "NominalCcy": "USD", 
    "NominalStueck": "5,000", 
    "PrimaryCCR": "0", 
    "QuoteType": "Nominal", 
    "RealValor": "0", 
    "Remarks": "", 
    "RwbeProductId_CCR": "034900", 
    "RwbeProductId_EFS": "034900", 
    "SecName": "Cliquet GROI on Nasdaq", 
    "SecType": "EQ", 
    "SubscriptionEndDate": "1900-01-01", 
    "TerminationDate": "2003-10-19", 
    "TradingCcy": "USD", 
    "Valor": 1309649
}

I'm trying to read this JSON in order to save it as a .csv (so that I can import it into a database) 我正在尝试读取此JSON以将其保存为.csv(以便我可以将其导入数据库)

However when i try to write this JSON data as a csv like so: 但是,当我尝试将此JSON数据写为csv时,如下所示:

with codecs.open('EFSDUMP.csv', 'w', 'utf-8-sig') as csv_file:
    content_writer = csv.writer(csv_file, delimiter=',')
    content_writer.writerow(data.values())

I get an error: 我收到一个错误:

UnicodeEncodeError: 'ascii' codec can't encode character u'\xdc' in position 25: ordinal not in range(128)

That is because there's an umlaut in the JSON (see attribute "Book"). 那是因为JSON中有一个变音符号(参见属性“Book”)。

I try to read the JSON like this: 我试着像这样阅读JSON:

data = json.loads(open('EFSDUMP.json').read().decode('utf-8-sig'))

What's interesting is that this: 有趣的是这个:

print data

Gives me this: 给我这个:

{u'PrimaryCCR': u'0', u'SecType': u'EQ', u'Valor': 1309649, u'KbType': u'Prozent', u'Book': u'US Derivat. London, Mike \xdcbersax/Michael Jealous', u'Keyinvest_RecurringRetro': u'1.00% pro rata temporis', u'TerminationDate': u'2003-10-19', u'RwbeProductId_CCR': u'034900', u'SubscriptionEndDate': u'1900-01-01', u'ExpirationDate': u'2006-10-12', u'Keyinvest_RetroPayment': u'Every month', u'Keyinvest_IssueRetro': u'0.50%', u'QuoteType': u'Nominal', u'KbYear': u'0.5', u'LastImportDate': u'2008-12-31', u'Remarks': u'', u'RealValor': u'0', u'SecName': u'Cliquet GROI on Nasdaq', u'Active': False, u'KbPeriodDay': u'Period', u'Isin': u'CH0013096497', u'LiberierungDate': u'1900-01-01', u'IssueDate': u'2001-10-09', u'KbForXMonths': u'0', u'NominalCcy': u'USD', u'RwbeProductId_EFS': u'034900', u'TradingCcy': u'USD', u'NominalStueck': u'5,000'}

Clearly the umlaut became a '\\xdc' 显然变音符号变成'\\ xdc'

However when I do this: 但是当我这样做时:

print data['Book']

Meaning I access the attribute directly, I get: 意思是我直接访问该属性,我得到:

US Derivat. London, Mike Übersax/Michael Jealous

So the umlaut is an actual umlaut again. 因此变音符号再次成为真正的变音符号。

I'm pretty sure that the JSON is UTF-8 without BOM (Notepad++ claims so) 我很确定JSON是没有BOM的UTF-8(Notepad ++声称如此)

I have already tried all of the suggestions here without any success: Python load json file with UTF-8 BOM header 我已经尝试了所有这些建议而没有任何成功: Python加载带有UTF-8 BOM头的json文件

How can I properly read the UTF-8 JSON file in order to be able to write it as .csv? 如何才能正确读取UTF-8 JSON文件才能将其写为.csv?

Any help is greatly appreciated. 任何帮助是极大的赞赏。

Python version: 2.7.2 Python版本:2.7.2

In Python 2, the csv module does not support writing Unicode. 在Python 2中, csv模块不支持编写Unicode。 You need to encode it manually here, as otherwise your Unicode values are encoded for you using ASCII (which is why you got the encoding exception). 您需要在此处手动编码,否则您的Unicode值将使用ASCII为您编码(这就是您获得编码异常的原因)。

This also means you need to write the UTF-8 BOM manually, but only if you really need it . 这也意味着您需要手动编写UTF-8 BOM, 但前提是您确实需要它 UTF-8 can only be written one way, a Byte Order Mark is not needed to read UTF-8 files. UTF-8只能以单向编写,不需要字节顺序标记来读取UTF-8文件。 Microsoft likes to add it to files to make the task of detecting file encodings easier for their tools, but the UTF-8 BOM may actually make it harder for other tools to work correctly as they won't ignore the extra initial character. 微软喜欢将它添加到文件中,以便为他们的工具更轻松地检测文件编码,但UTF-8 BOM实际上可能使其他工具更难以正常工作,因为他们不会忽略额外的初始字符。

Use: 采用:

with open('EFSDUMP.csv', 'wb') as csv_file:
    csv_file.write(codecs.BOM_UTF8)
    content_writer = csv.writer(csv_file)
    content_writer.writerow([unicode(v).encode('utf8') for v in data.values()])

Note that this'll write your values in arbitrary (dictionary) order. 请注意,这将以任意(字典)顺序写入您的值。 The unicode() call will convert non-string types to unicode strings first before encoding. unicode()调用将在编码之前首先将非字符串类型转换为unicode字符串。

To be explicit: you've loaded the JSON data just fine. 要明确:你已经加载了JSON数据就好了。 It is the CSV writing that failed for you. CSV编写失败了。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM