Pyparsing从文件中读取unicode字符

Question

I'd like to read some values from sample.cfg file and parse them. 我想从sample.cfg文件中读取一些值并解析它们。 The code looks like this : 代码如下所示：

from pyparsing import *

key = Word(alphanums)('key')
equals = Suppress('=')
value = Word(alphanums)('value')

kvexpression = key + equals + value

with open('sample.cfg') as config_in:
  config_data = config_in.read()

for match in kvexpression.scanString(config_data):
  result = match[0]
  print("{0} is {1}".format(result.key, result.value))

If I use ASCII characters it works fine. 如果我使用ASCII字符，它工作正常。 Like this : 像这样：

sample.cfg 为sample.cfg

city=Atlanta
state=Georgia
population=5522942

But if I use some unicode characters in the input file. 但是如果我在输入文件中使用一些unicode字符。 It doesn't works as expected. 它没有按预期工作。

sample.cfg (with unicode letters) sample.cfg（带有unicode字母）

şehir=İzmir
ülke=Türkiye
nüfus=4279677

If you run this program its output is like this: 如果你运行这个程序，它的输出是这样的：

lke is T
fus is 4279677

As you'd see it ignores unicode characters. 正如你所看到的那样忽略了unicode字符。

Update : 更新：

I altered the code as suggested. 我按照建议更改了代码。 Now it became like this : 现在它变成了这样：

from pyparsing import*

key = Word(alphanums + alphas8bit)('key')
equals = Suppress('=')
value = Word(alphanums + alphas8bit)('value')

kvexpression = key + equals + value

with open('şehir.cfg') as config_in:
  config_data = config_in.read()

for match in kvexpression.scanString(config_data):
  result = match[0]
  print("{0} is {1}".format(result.key, result.value))

And minor changes in data file : 数据文件中的细微变化：

sample.cfg 为sample.cfg

şehir=İzmir
ülke=Türkiye
nüfus=4279677
alfabe=AaBbCcÇçDdEeFfGgĞğHhIiİiJjKkLlMmNnOoÖöPpRrSsŞşTtUuÜüVvYyZz

When I run the program it's output is like this. 当我运行程序时，它的输出是这样的。

ülke is Türkiye
nüfus is 4279677
alfabe is AaBbCcÇçDdEeFfGg

As you'd see the first line which starts with accented s 'ş' is not displayed. 正如您所看到的那样，第一行以重音s'ş'开头并未显示。 I noticed this situation before. 我之前注意到了这种情况。

Almost there, yet not quite. 几乎在那里，但并不完全。

I use a linux box. 我用的是linux盒子。

Answer 1

Replace alphanums with alphanums+alphas8bit in two places in your code, as in this line. 在代码中的两个位置用alphanums+alphas8bit替换alphanums ，如此行所示。

key = Word(alphanums+alphas8bit)('key')

The problem is that alphanums matches only the unaccented Latin alphabet (plus the numerical digits). 问题是， alphanums只匹配非重音的拉丁字母（加上数字）。 alphas8bit matches the additional 8-bit characters in Latin-1. alphas8bit匹配Latin-1中的其他8位字符。

When I run the altered code against this input, 当我针对此输入运行更改的代码时，

sehir=Izmir
ülke=Türkiye
nüfus=4279677
AaBbCcÇçDdEeFfGgGgHhIiIiJjKkLlMmNnOoÖöPpRrSsSsTtUuÜüVvYyZz = 5

where the entire Turkish alphabet appears in the last line, the result is, 整个土耳其语字母出现在最后一行的结果是，

sehir is Izmir
ülke is Türkiye
nüfus is 4279677
AaBbCcÇçDdEeFfGgGgHhIiIiJjKkLlMmNnOoÖöPpRrSsSsTtUuÜüVvYyZz is 5

Answer 2

I've find a solution by my self. 我自己找到了解决方案。 I don't know whether it is the convenient way to achieve this. 我不知道这是否是实现这一目标的便捷方式。 But it looks fine to me. 但它对我来说很好看。

from pyparsing import* 从pyparsing进口*

alphanums_tr = u'abcçdefgğhiijklmnoöprsştuüvyzABCÇDEFGĞHIİJKLMNOÖPRSŞTUÜVYZ0123456789'

key = Word(alphanums_tr)('key')
equals = Suppress('=')
value = Word(alphanums_tr)('value')

kvexpression = key + equals + value

with open('şehir.cfg') as config_in:
  config_data = config_in.read()

for match in kvexpression.scanString(config_data):
    result = match[0]
    print("{0} is {1}".format(result.key, result.value))

Program's output is like this : 程序的输出是这样的：

şehir is İzmir
ülke is Türkiye
nüfus is 4279677
alfabe is AaBbCcÇçDdEeFfGgĞğHhIiİiJjKkLlMmNnOoÖöPpRrSsŞşTtUuÜüVvYyZz

Pyparsing从文件中读取unicode字符

问题描述

2 个解决方案

解决方案1
3 2018-10-15 20:50:41

解决方案2
2 2018-10-16 06:53:36

Pyparsing从文件中读取unicode字符

问题描述

2 个解决方案

解决方案1 3 2018-10-15 20:50:41

解决方案2 2 2018-10-16 06:53:36

解决方案1
3 2018-10-15 20:50:41

解决方案2
2 2018-10-16 06:53:36