简体   繁体   中英

Pyparsing reading unicode characters from file

I'd like to read some values from sample.cfg file and parse them. The code looks like this :

from pyparsing import *

key = Word(alphanums)('key')
equals = Suppress('=')
value = Word(alphanums)('value')

kvexpression = key + equals + value

with open('sample.cfg') as config_in:
  config_data = config_in.read()

for match in kvexpression.scanString(config_data):
  result = match[0]
  print("{0} is {1}".format(result.key, result.value))

If I use ASCII characters it works fine. Like this :

sample.cfg

city=Atlanta
state=Georgia
population=5522942

But if I use some unicode characters in the input file. It doesn't works as expected.

sample.cfg (with unicode letters)

şehir=İzmir
ülke=Türkiye
nüfus=4279677

If you run this program its output is like this:

lke is T
fus is 4279677

As you'd see it ignores unicode characters.

Update :

I altered the code as suggested. Now it became like this :

from pyparsing import*

key = Word(alphanums + alphas8bit)('key')
equals = Suppress('=')
value = Word(alphanums + alphas8bit)('value')

kvexpression = key + equals + value

with open('şehir.cfg') as config_in:
  config_data = config_in.read()

for match in kvexpression.scanString(config_data):
  result = match[0]
  print("{0} is {1}".format(result.key, result.value))

And minor changes in data file :

sample.cfg

şehir=İzmir
ülke=Türkiye
nüfus=4279677
alfabe=AaBbCcÇçDdEeFfGgĞğHhIiİiJjKkLlMmNnOoÖöPpRrSsŞşTtUuÜüVvYyZz

When I run the program it's output is like this.

ülke is Türkiye
nüfus is 4279677
alfabe is AaBbCcÇçDdEeFfGg

As you'd see the first line which starts with accented s 'ş' is not displayed. I noticed this situation before.

Almost there, yet not quite.

I use a linux box.

Replace alphanums with alphanums+alphas8bit in two places in your code, as in this line.

key = Word(alphanums+alphas8bit)('key')

The problem is that alphanums matches only the unaccented Latin alphabet (plus the numerical digits). alphas8bit matches the additional 8-bit characters in Latin-1.

When I run the altered code against this input,

sehir=Izmir
ülke=Türkiye
nüfus=4279677
AaBbCcÇçDdEeFfGgGgHhIiIiJjKkLlMmNnOoÖöPpRrSsSsTtUuÜüVvYyZz = 5

where the entire Turkish alphabet appears in the last line, the result is,

sehir is Izmir
ülke is Türkiye
nüfus is 4279677
AaBbCcÇçDdEeFfGgGgHhIiIiJjKkLlMmNnOoÖöPpRrSsSsTtUuÜüVvYyZz is 5

I've find a solution by my self. I don't know whether it is the convenient way to achieve this. But it looks fine to me.

from pyparsing import*

alphanums_tr = u'abcçdefgğhiijklmnoöprsştuüvyzABCÇDEFGĞHIİJKLMNOÖPRSŞTUÜVYZ0123456789'

key = Word(alphanums_tr)('key')
equals = Suppress('=')
value = Word(alphanums_tr)('value')

kvexpression = key + equals + value

with open('şehir.cfg') as config_in:
  config_data = config_in.read()

for match in kvexpression.scanString(config_data):
    result = match[0]
    print("{0} is {1}".format(result.key, result.value))

Program's output is like this :

şehir is İzmir
ülke is Türkiye
nüfus is 4279677
alfabe is AaBbCcÇçDdEeFfGgĞğHhIiİiJjKkLlMmNnOoÖöPpRrSsŞşTtUuÜüVvYyZz

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM