[英]Pyparsing reading unicode characters from file
I'd like to read some values from sample.cfg file and parse them. 我想从sample.cfg文件中读取一些值并解析它们。 The code looks like this : 代码如下所示:
from pyparsing import *
key = Word(alphanums)('key')
equals = Suppress('=')
value = Word(alphanums)('value')
kvexpression = key + equals + value
with open('sample.cfg') as config_in:
config_data = config_in.read()
for match in kvexpression.scanString(config_data):
result = match[0]
print("{0} is {1}".format(result.key, result.value))
If I use ASCII characters it works fine. 如果我使用ASCII字符,它工作正常。 Like this : 像这样 :
sample.cfg 为sample.cfg
city=Atlanta
state=Georgia
population=5522942
But if I use some unicode characters in the input file. 但是如果我在输入文件中使用一些unicode字符。 It doesn't works as expected. 它没有按预期工作。
sample.cfg (with unicode letters) sample.cfg(带有unicode字母)
şehir=İzmir
ülke=Türkiye
nüfus=4279677
If you run this program its output is like this: 如果你运行这个程序,它的输出是这样的:
lke is T
fus is 4279677
As you'd see it ignores unicode characters. 正如你所看到的那样忽略了unicode字符。
Update : 更新:
I altered the code as suggested. 我按照建议更改了代码。 Now it became like this : 现在它变成了这样:
from pyparsing import*
key = Word(alphanums + alphas8bit)('key')
equals = Suppress('=')
value = Word(alphanums + alphas8bit)('value')
kvexpression = key + equals + value
with open('şehir.cfg') as config_in:
config_data = config_in.read()
for match in kvexpression.scanString(config_data):
result = match[0]
print("{0} is {1}".format(result.key, result.value))
And minor changes in data file : 数据文件中的细微变化:
sample.cfg 为sample.cfg
şehir=İzmir
ülke=Türkiye
nüfus=4279677
alfabe=AaBbCcÇçDdEeFfGgĞğHhIiİiJjKkLlMmNnOoÖöPpRrSsŞşTtUuÜüVvYyZz
When I run the program it's output is like this. 当我运行程序时,它的输出是这样的。
ülke is Türkiye
nüfus is 4279677
alfabe is AaBbCcÇçDdEeFfGg
As you'd see the first line which starts with accented s 'ş' is not displayed. 正如您所看到的那样,第一行以重音s'ş'开头并未显示。 I noticed this situation before. 我之前注意到了这种情况。
Almost there, yet not quite. 几乎在那里,但并不完全。
I use a linux box. 我用的是linux盒子。
Replace alphanums
with alphanums+alphas8bit
in two places in your code, as in this line. 在代码中的两个位置用alphanums+alphas8bit
替换alphanums
,如此行所示。
key = Word(alphanums+alphas8bit)('key')
The problem is that alphanums
matches only the unaccented Latin alphabet (plus the numerical digits). 问题是, alphanums
只匹配非重音的拉丁字母(加上数字)。 alphas8bit
matches the additional 8-bit characters in Latin-1. alphas8bit
匹配Latin-1中的其他8位字符。
When I run the altered code against this input, 当我针对此输入运行更改的代码时,
sehir=Izmir
ülke=Türkiye
nüfus=4279677
AaBbCcÇçDdEeFfGgGgHhIiIiJjKkLlMmNnOoÖöPpRrSsSsTtUuÜüVvYyZz = 5
where the entire Turkish alphabet appears in the last line, the result is, 整个土耳其语字母出现在最后一行的结果是,
sehir is Izmir
ülke is Türkiye
nüfus is 4279677
AaBbCcÇçDdEeFfGgGgHhIiIiJjKkLlMmNnOoÖöPpRrSsSsTtUuÜüVvYyZz is 5
I've find a solution by my self. 我自己找到了解决方案。 I don't know whether it is the convenient way to achieve this. 我不知道这是否是实现这一目标的便捷方式。 But it looks fine to me. 但它对我来说很好看。
from pyparsing import* 从pyparsing进口*
alphanums_tr = u'abcçdefgğhiijklmnoöprsştuüvyzABCÇDEFGĞHIİJKLMNOÖPRSŞTUÜVYZ0123456789'
key = Word(alphanums_tr)('key')
equals = Suppress('=')
value = Word(alphanums_tr)('value')
kvexpression = key + equals + value
with open('şehir.cfg') as config_in:
config_data = config_in.read()
for match in kvexpression.scanString(config_data):
result = match[0]
print("{0} is {1}".format(result.key, result.value))
Program's output is like this : 程序的输出是这样的:
şehir is İzmir
ülke is Türkiye
nüfus is 4279677
alfabe is AaBbCcÇçDdEeFfGgĞğHhIiİiJjKkLlMmNnOoÖöPpRrSsŞşTtUuÜüVvYyZz
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.