简体   繁体   English

PyYaml - 转储带有特殊字符(即口音)的 unicode

[英]PyYaml - Dump unicode with special characters ( i.e. accents )

I'm working with yaml files that have to be human readable and editable but that will also be edited from Python code.我正在使用 yaml 文件,这些文件必须是人类可读和可编辑的,但也可以从 Python 代码进行编辑。 I'm using Python 2.7.3我正在使用 Python 2.7.3

The file needs to handle accents ( mostly to handle text in French ).该文件需要处理重音(主要是处理法语文本)。

Here is a sample of my issue:这是我的问题的示例:

import codecs
import yaml

file = r'toto.txt'

f = codecs.open(file,"w",encoding="utf-8")

text = u'héhéhé, hûhûhû'

textDict = {"data": text}

f.write( 'write unicode     : ' + text + '\n' )
f.write( 'write dict        : ' + unicode(textDict) + '\n' )
f.write( 'yaml dump unicode : ' + yaml.dump(text))
f.write( 'yaml dump dict    : ' + yaml.dump(textDict))
f.write( 'yaml safe unicode : ' + yaml.safe_dump(text))
f.write( 'yaml safe dict    : ' + yaml.safe_dump(textDict))

f.close()

The written file contains:书面文件包含:

write unicode     : héhéhé, hûhûhû
write dict        : {'data': u'h\xe9h\xe9h\xe9, h\xfbh\xfbh\xfb\n'}

yaml dump unicode : "h\xE9h\xE9h\xE9, h\xFBh\xFBh\xFB"
yaml dump dict    : {data: "h\xE9h\xE9h\xE9, h\xFBh\xFBh\xFB"}

yaml safe unicode : "h\xE9h\xE9h\xE9, h\xFBh\xFBh\xFB"
yaml safe dict    : {data: "h\xE9h\xE9h\xE9, h\xFBh\xFBh\xFB"}

The yaml dump works perfectly for loading with yaml, but it is not human readable. yaml 转储非常适合加载 yaml,但它不是人类可读的。

As you can see in the exemple code, the result is the same when I try to write a unicode representation of a dict ( I don't know if it is related or not ).正如您在示例代码中看到的,当我尝试编写 dict 的 unicode 表示时,结果是一样的(我不知道它是否相关)。

I'd like the dump to contains the text with accent, not the unicode code.我希望转储包含带重音的文本,而不是 unicode 代码。 Is that possible ?那可能吗 ?

yaml is capable of dumping unicode characters by providing the allow_unicode=True keyword argument to any of the dumpers. yaml能够通过向任何转储程序提供allow_unicode=True关键字参数来转储 unicode 字符。 If you don't provide a file, you will get an utf-8 string back from dump() method (ie the result of getvalue() on the StringIO() instance that is created to hold the dumped data) and you have to convert that to utf-8 before appending it to your string如果不提供文件,你会得到一个UTF-8字符串从后dump()即结果的方法getvalue()StringIO()所创建保存转储数据实例),你必须在将其附加到您的字符串之前将其转换为utf-8

# coding: utf-8

import codecs
import ruamel.yaml as yaml

file_name = r'toto.txt'

text = u'héhéhé, hûhûhû'

textDict = {"data": text}

with open(file_name, 'w') as fp:
    yaml.dump(textDict, stream=fp, allow_unicode=True)

print('yaml dump dict 1   : ' + open(file_name).read()),

f = codecs.open(file_name,"w",encoding="utf-8")
f.write('yaml dump dict 2   : ' + yaml.dump(textDict, allow_unicode=True).decode('utf-8'))
f.close()
print(open(file_name).read())

output:输出:

yaml dump dict 1    : {data: 'héhéhé, hûhûhû'}
yaml dump dict 2    : {data: 'héhéhé, hûhûhû'}

I tested this with my enhanced version of PyYAML ( ruamel.yaml ), but this should work the same in PyYAML itself.我用我的增强版 PyYAML ( ruamel.yaml ) 对此进行了测试,但这在 PyYAML 本身中应该是一样的。

Update (2020)更新 (2020)

Nowadays, PyYaml does easily process unicode with Python 3, but this requires the allow_unicode=True argument:如今, PyYaml确实可以轻松地使用 Python 3 处理 unicode,但这需要allow_unicode=True参数:

import yaml
d = {'a': 'héhéhé', 'b': 'hühühü'}
yaml_code = yaml.dump(d, allow_unicode=True, sort_keys=False)
print(yaml_code)

Will result in:会导致:

a: héhéhé
b: hühühü

Note : The sortkeys=False argument should be used as of Python 3.6, to leave the keys of the dictionary unaltered.注意sortkeys=False参数应该从 Python 3.6 开始使用,以保持字典的键不变。 PyYaml has been traditionally sorting keys, because Python dictionaries did not have a definite order. PyYaml 传统上对键进行排序,因为 Python 词典没有明确的顺序。 Even though dictionary keys have been ordered since Python 3.6;尽管从 Python 3.6 开始就已经对字典键进行了排序; and officially since 3.7 , PyYaml has kept sorting keys by default. 从 3.7 正式开始,PyYaml 默认保持排序键。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM