简体   繁体   English

Pyparsing从文件中读取unicode字符

[英]Pyparsing reading unicode characters from file

I'd like to read some values from sample.cfg file and parse them. 我想从sample.cfg文件中读取一些值并解析它们。 The code looks like this : 代码如下所示:

from pyparsing import *

key = Word(alphanums)('key')
equals = Suppress('=')
value = Word(alphanums)('value')

kvexpression = key + equals + value

with open('sample.cfg') as config_in:
  config_data = config_in.read()

for match in kvexpression.scanString(config_data):
  result = match[0]
  print("{0} is {1}".format(result.key, result.value))

If I use ASCII characters it works fine. 如果我使用ASCII字符,它工作正常。 Like this : 像这样 :

sample.cfg 为sample.cfg

city=Atlanta
state=Georgia
population=5522942

But if I use some unicode characters in the input file. 但是如果我在输入文件中使用一些unicode字符。 It doesn't works as expected. 它没有按预期工作。

sample.cfg (with unicode letters) sample.cfg(带有unicode字母)

şehir=İzmir
ülke=Türkiye
nüfus=4279677

If you run this program its output is like this: 如果你运行这个程序,它的输出是这样的:

lke is T
fus is 4279677

As you'd see it ignores unicode characters. 正如你所看到的那样忽略了unicode字符。

Update : 更新:

I altered the code as suggested. 我按照建议更改了代码。 Now it became like this : 现在它变成了这样:

from pyparsing import*

key = Word(alphanums + alphas8bit)('key')
equals = Suppress('=')
value = Word(alphanums + alphas8bit)('value')

kvexpression = key + equals + value

with open('şehir.cfg') as config_in:
  config_data = config_in.read()

for match in kvexpression.scanString(config_data):
  result = match[0]
  print("{0} is {1}".format(result.key, result.value))

And minor changes in data file : 数据文件中的细微变化:

sample.cfg 为sample.cfg

şehir=İzmir
ülke=Türkiye
nüfus=4279677
alfabe=AaBbCcÇçDdEeFfGgĞğHhIiİiJjKkLlMmNnOoÖöPpRrSsŞşTtUuÜüVvYyZz

When I run the program it's output is like this. 当我运行程序时,它的输出是这样的。

ülke is Türkiye
nüfus is 4279677
alfabe is AaBbCcÇçDdEeFfGg

As you'd see the first line which starts with accented s 'ş' is not displayed. 正如您所看到的那样,第一行以重音s'ş'开头并未显示。 I noticed this situation before. 我之前注意到了这种情况。

Almost there, yet not quite. 几乎在那里,但并不完全。

I use a linux box. 我用的是linux盒子。

Replace alphanums with alphanums+alphas8bit in two places in your code, as in this line. 在代码中的两个位置用alphanums+alphas8bit替换alphanums ,如此行所示。

key = Word(alphanums+alphas8bit)('key')

The problem is that alphanums matches only the unaccented Latin alphabet (plus the numerical digits). 问题是, alphanums只匹配非重音的拉丁字母(加上数字)。 alphas8bit matches the additional 8-bit characters in Latin-1. alphas8bit匹配Latin-1中的其他8位字符。

When I run the altered code against this input, 当我针对此输入运行更改的代码时,

sehir=Izmir
ülke=Türkiye
nüfus=4279677
AaBbCcÇçDdEeFfGgGgHhIiIiJjKkLlMmNnOoÖöPpRrSsSsTtUuÜüVvYyZz = 5

where the entire Turkish alphabet appears in the last line, the result is, 整个土耳其语字母出现在最后一行的结果是,

sehir is Izmir
ülke is Türkiye
nüfus is 4279677
AaBbCcÇçDdEeFfGgGgHhIiIiJjKkLlMmNnOoÖöPpRrSsSsTtUuÜüVvYyZz is 5

I've find a solution by my self. 我自己找到了解决方案。 I don't know whether it is the convenient way to achieve this. 我不知道这是否是实现这一目标的便捷方式。 But it looks fine to me. 但它对我来说很好看。

from pyparsing import* 从pyparsing进口*

alphanums_tr = u'abcçdefgğhiijklmnoöprsştuüvyzABCÇDEFGĞHIİJKLMNOÖPRSŞTUÜVYZ0123456789'

key = Word(alphanums_tr)('key')
equals = Suppress('=')
value = Word(alphanums_tr)('value')

kvexpression = key + equals + value

with open('şehir.cfg') as config_in:
  config_data = config_in.read()

for match in kvexpression.scanString(config_data):
    result = match[0]
    print("{0} is {1}".format(result.key, result.value))

Program's output is like this : 程序的输出是这样的:

şehir is İzmir
ülke is Türkiye
nüfus is 4279677
alfabe is AaBbCcÇçDdEeFfGgĞğHhIiİiJjKkLlMmNnOoÖöPpRrSsŞşTtUuÜüVvYyZz

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM