简体   繁体   中英

Pyparsing for unicode letters

I need to use pyparsing for unicode characters. So I tried simple example from their github repository with French character cédille and gives error.

My code

from pyparsing import Word, alphas
greet = Word(alphas) + "," + Word(alphas) + "!"
hello = "Hello, cédille!"
greet.parseString(hello)

and it gives error

pyparsing.ParseException: Expected "!" (at char 8), (line:1, col:9)

Is there a way to solve this problem?

Pyparsing has the pyparsing_unicode module that defines a number of unicode character ranges with definitions for alphas , nums , and so on within each range. Ranges include CJK , Cyrillic , Devanagari , Hebrew , Arabic , and others. The greetingInGreek.py and greetingInKorean.py examples in the examples directory show a couple of these in action.

Your example, using the Latin1 set, will look like:

from pyparsing import Word, pyparsing_unicode as ppu
intl_alphas = ppu.Latin1.alphas
greet = Word(intl_alphas) + "," + Word(intl_alphas) + "!"
hello = "Hello, cédille!"
print(greet.parseString(hello))

Prints:

['Hello', ',', 'cédille', '!']

alphas8bit will probably be kept for legacy support, but new applications should use pyparsing_unicode.Latin1.alphas .

alphas is apparently English / pure ASCII only. The following appears to work:

from pyparsing import Word, alphas, alphas8bit
greet = Word(alphas+alphas8bit) + "," + Word(alphas+alphas8bit) + "!"
hello = "Hello, cédille!"
greet.parseString(hello)

This is Unicode, so there is nothing particularly "8-bit" about the character é ; but if the documentation is at least approximately correct, I guess it will still break with slightly more exotic accented characters (anything not available in Latin-1, like Czech or Polish accented characters, or go extreme and try Vietnamese).

Maybe explore the unicodedata module to get a proper enumeration of "alphabetic" characters, or find a third-party module which exposes this Unicode feature properly.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM