简体   繁体   中英

python regex to find accented words

Please I need help. I've got a problem when trying to find accented words in a text (in Spanish). I have to search in a large text the first paragraph starting with the words 'Nombre vernáculo'
For example, the text is like: "Nombre vernáculo registrado en la zona de ..."
But accented words are not recoginzed by my python script.

I've tryed with:

re.compile('/(?<!\p{L})(vern[áa]culo*)(?!\p{L})/')
re.compile(r'Nombre vern[a\xc3\xa1]culo\.', re.UNICODE)
re.compile ('[A-Z][a-záéíóúñ]+')
\p{Lu}] [\p{Ll}]+ \b

I've read the following threads:

grep/regex can't find accented word
Python Regex strange behavior with accented characters
Python regex and accented Expression
Python: using regex and tokens with accented chars (negative lookbehind)

Also I found something that almost work:

In [95]: dd=re.search(r'^\w.*', 'Nombre vernáculo' )
In [96]: dd.group(0)
Out[96]: 'Nombre vern\xc3\xa1culo'

But it also returns all accented words in the text.

Any help with this will be appreciaded. Thanks.

The simplest way to do this is the same way you'd do it in Python 3. This means you have to explicitly use unicode instead of str objects, include u -prefixed string literals. And, ideally, an explicit coding declaration at the top of your file so you can write the literals in Unicode as well.

# -*- coding: utf-8 -*-

import re

pattern = re.compile(ur'Nombre vern[aá]culo'`)
text = u'Nombre vernáculo'
match = pattern.search(text)
print match

Notice that I left off the \\. on the end of the pattern. Your text doesn't end in a . , so you shouldn't be looking for one, or it's going to fail.

Of course if you want to search text that comes from somewhere besides your source code, you'll need to decode('utf-8') it, or io.open or codecs.open the file instead of just open , etc.


If you can't use a coding declaration, or can't trust your text editor to handle UTF-8, you can still use Unicode strings, just escape the characters with their Unicode code points:

import re

pattern = re.compile(ur'Nombre vern[a\xe1]culo'`)
text = u'Nombre vern\xe1culo'
match = pattern.search(text)
print match

If you have to use str , then you do have to manually encode to UTF-8 and escape the individual bytes, as you were trying to do. But now you're not trying to match a single character, but a multi-character sequence, \\xc3\\xa1 . So you can't use a character class. Instead, you have write it out explicitly as a group with alternation:

pattern = re.compile(r'Nombre vern(?:a|\xc3\xa1)culo')
text = 'Nombre vern\xc3\xa1culo'
match = pattern.search(text)
print match
import re
r1 = re.compile(r'(Nombre vernáculo)')
x = 'Nombre vernáculo registrado en la zona de'
match = r1.search(x)
print(match.group(1))

with python 2:

/tmp> python2 test.py
  File "test.py", line 5
SyntaxError: Non-ASCII character '\xc3' in file test.py on line 5, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details

with python 3:

/tmp> python3 test.py 
Nombre vernáculo

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM