简体   繁体   中英

Problems with python regex encoding?

I have a large .txt file that is made up of: word1 , word2 , id , number as follows:

s = '''
Vaya ir VMM03S0 0.427083
mañanita mañana RG 0.796611
, , Fc 1
buscando buscar VMG0000 1
una uno DI0FS0 0.951575
lavadora lavadora NCFS000 0.414738
con con SPS00 1
la el DA0FS0 0.972269
que que PR0CN000 0.562517
sorprender sorprender VMN0000 1
a a SPS00 0.996023
una uno DI0FS0 0.951575
persona persona NCFS000 0.98773
muy muy RG 1
especiales especial AQ0CS0 1
para para SPS00 0.999103
nosotros nosotros PP1MP000 1
, , Fc 1
y y CC 0.999962
la lo PP3FSA00 0.0277039
encontramos encontrar VMIP1P0 0.65
. . Fp 1

Pero pero CC 0.999764
vamos ir VMIP1P0 0.655914
a a SPS00 0.996023
lo el DA0NS0 0.457533
que que PR0CN000 0.562517
interesa interesar VMIP3S0 0.994868
LO_QUE_INTERESA_La lo_que_interesa_la NP00000 1
lavadora lavador AQ0FS0 0.585262
tiene tener VMIP3S0 1
una uno DI0FS0 0.951575
clasificación clasificación NCFS000 1
A+ a+ NP00000 1
, , Fc 1
de de SPS00 0.999984
las el DA0FP0 0.970954
que que PR0CN000 0.562517
ahorran ahorrar VMIP3P0 1
energía energía NCFS000 1
, , Fc 1
si si CS 0.99954
no no RN 0.998134
me me PP1CS000 0.89124
equivoco equivocar VMIP1S0 1
. . Fp 1

Lava lavar VMIP3S0 0.397388
hasta hasta SPS00 0.957698
7 7 Z 1
kg kilogramo NCMN000 1
, , Fc 1
no no RN 0.998134
está estar VAIP3S0 0.999201
nada nada RG 0.135196
mal mal RG 0.497537
, , Fc 1
se se P00CN000 0.465639
le le PP3CSD00 1
veía ver VMII3S0 0.62272
un uno DI0MS0 0.987295
gran gran AQ0CS0 1
tambor tambor NCMS000 1
( ( Fpa 1
de de SPS00 0.999984
acero acero NCMS000 0.973481
inoxidable inoxidable AQ0CS0 1
) ) Fpt 1
y y CC 0.999962
un uno DI0MS0 0.987295
consumo consumo NCMS000 0.948927
máximo máximo AQ0MS0 0.986111
de de SPS00 0.999984
49 49 Z 1
litros litro NCMP000 1
Mandos mandos NP00000 1
intuitivos intuitivo AQ0MP0 1
, , Fc 1
todo todo PI0MS000 0.43165
muy muy RG 1
bien bien RG 0.902728
explicado explicar VMP00SM 1
, , Fc 1
nada nada PI0CS000 0.850279
que que PR0CN000 0.562517
ver ver VMN0000 0.997382
con con SPS00 1
hola RG 0.90937838
como VMP00SM 1
estas AQ089FG 0.90839
la el DA0FS0 0.972269
lavadora lavadora NCFS000 0.414738
de de SPS00 0.999984
casa casa NCFS000 0.979058
de de SPS00 0.999984
mis mi DP1CPS 0.995868
padres padre NCMP000 1
Además además NP00000 1
también también RG 1
seca seco AQ0FS0 0.45723
preciadas preciar VMP00PF 1
. . Fp 1'''

For example for the s "file" I would like to extract the ids that start with AQ and RG followed by their word2 , but they must ocurre one after the other for the above example this words hold the one after another order :

muy muy RG 1
especial especial AQ0CS0 1

For example this words doesnt hold the one after another order , so I would not like to extract them in a tuple:

hola RG 0.90937838
como VMP00SM 1
estas AQ089FG 0.90839

I would like to create a regex that extract in a tuple list only the word2 followed by its id like this: [('word2','id')] for all the .txt file and for all the words that hold true the one after another order. For the above example this is the only valid output:

muy muy RG 1
especiales especial AQ0CS0 1

and

también también RG 1
seca seco AQ0FS0 0.45723

Then return them in a tuple with its full id , since they preserve the one after another order:

[('muy', 'RG', 'especial', 'AQ0CS0'), ('también', 'RG', 'seco', 'AQ0FS0')]

I tried the following:

in:

t = re.findall(r'(\w+)\s*(RG)[^\n]*\n[^\n]*?(\w+)\s*(AQ\w*)', s)
print t

But my output is wrong, since it is droping the accent and some characters:

out:

[('muy', 'RG', 'especial', 'AQ0CS0'), ('n', 'RG', 'seco', 'AQ0FS0')]

instead of, which is the correct:

[('muy', 'RG', 'especial', 'AQ0CS0'), ('también', 'RG', 'seco', 'AQ0FS0')]

Could someone help me to understand what happened with my above example and how to fix it in order to catch the word2 and id that preserve the one after another ocurrence?. Thanks in advance guys.

it seems that \\w+ don't recognize special char é.

so if your txt is strictly split by space, you can replace \\w with \\S

the regex will be

t = re.findall(r'(\S+)\s*(RG)[^\n]*\n[^\n]*?(\S+)\s*(AQ\S*)', s)

In Python 2, with the 8-bit strings ( str ), \\w matches [0-9a-zA-Z_] . However if your use unicode and compile your pattern with re.UNICODE flag, then \\w matches the word characters based on the unicode database. Python documentation 7.2.1 regular expression syntax :

When the LOCALE and UNICODE flags are not specified , matches any alphanumeric character and the underscore; this is equivalent to the set [a-zA-Z0-9_] . With LOCALE , it will match the set [0-9_] plus whatever characters are defined as alphanumeric for the current locale. If UNICODE is set, this will match the characters [0-9_] plus whatever is classified as alphanumeric in the Unicode character properties database .


Thus you can do

u = s.decode('UTF-8')  # or whatever encoding is in your text file
t = re.findall(r'(\w+)\s*(RG)[^\n]*\n[^\n]*?(\w+)\s*(AQ\w*)', re.UNICODE)

In Python 3 much of the str / unicode confusion is gone; when you open a file in text mode and read its contents, you will get a Python 3 str object that handles everything as Unicode characters.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM