简体   繁体   中英

Regex to match capital/special/unicode/vietnamese characters

I'm facing an issue. Indeed, I work with vietnamese texts and I want to find every word containing uppercase(s) (capital letter). When I use the 're' module, my function (temp) does not catch word like "Đà". The other way (temp2) is to check each character at a time, it works but it is slow since I have to split the sentences into words.

Hence I would like to know if there is a way of the "re" module to catch all the special capital letter.

I have 2 ways :

def temp(sentence):
    return re.findall(r'[a-z]*[A-Z]+[a-z]*', sentence)


lis=word_tokenize(sentence)
def temp2(lis):
    proper_noun=[]
    for word in lis:
        for letter in word:
            if letter.isupper():
                proper_noun.append(word)
                break
    return proper_noun

Input:

'nous avons 2 Đồng et 3 Euro'

Expected output :

['Đồng','Euro']

Thank you!

You may use this regex:

\b\S*[AĂÂÁẮẤÀẰẦẢẲẨÃẴẪẠẶẬĐEÊÉẾÈỀẺỂẼỄẸỆIÍÌỈĨỊOÔƠÓỐỚÒỒỜỎỔỞÕỖỠỌỘỢUƯÚỨÙỪỦỬŨỮỤỰYÝỲỶỸỴAĂÂÁẮẤÀẰẦẢẲẨÃẴẪẠẶẬĐEÊÉẾÈỀẺỂẼỄẸỆIÍÌỈĨỊOÔƠÓỐỚÒỒỜỎỔỞÕỖỠỌỘỢUƯÚỨÙỪỦỬŨỮỤỰYÝỲỶỸỴAĂÂÁẮẤÀẰẦẢẲẨÃẴẪẠẶẬĐEÊÉẾÈỀẺỂẼỄẸỆIÍÌỈĨỊOÔƠÓỐỚÒỒỜỎỔỞÕỖỠỌỘỢUƯÚỨÙỪỦỬŨỮỤỰYÝỲỶỸỴAĂÂÁẮẤÀẰẦẢẲẨÃẴẪẠẶẬĐEÊÉẾÈỀẺỂẼỄẸỆIÍÌỈĨỊOÔƠÓỐỚÒỒỜỎỔỞÕỖỠỌỘỢUƯÚỨÙỪỦỬŨỮỤỰYÝỲỶỸỴAĂÂÁẮẤÀẰẦẢẲẨÃẴẪẠẶẬĐEÊÉẾÈỀẺỂẼỄẸỆIÍÌỈĨỊOÔƠÓỐỚÒỒỜỎỔỞÕỖỠỌỘỢUƯÚỨÙỪỦỬŨỮỤỰYÝỲỶỸỴAĂÂÁẮẤÀẰẦẢẲẨÃẴẪẠẶẬĐEÊÉẾÈỀẺỂẼỄẸỆIÍÌỈĨỊOÔƠÓỐỚÒỒỜỎỔỞÕỖỠỌỘỢUƯÚỨÙỪỦỬŨỮỤỰYÝỲỶỸỴA-Z]+\S*\b

Regex Demo

The answer of @Rizwan M.Tuman is correct. I want to share with you the speed of execution of the three functions for 100,000 sentences.

lis=word_tokenize(sentence)
def temp(lis):
    proper_noun=[]
    for word in lis:
        for letter in word:
            if letter.isupper():
                proper_noun.append(word)
                break
    return proper_noun

def temp2(sentence):
    return re.findall(r'[a-z]*[A-Z]+[a-z]*', sentence)

def temp3(sentence):
    return re.findall(capital_letter,sentence)

By this way:

start_time = time.time()
for k in range(100000):
    temp2(sentence)
print("%s seconds" % (time.time() - start_time))

Here are the results:

>>Check each character of a list of words if it is a capital letter (.isupper())
(sentence has already been splitted into words)
0.4416656494140625 seconds

>>Function with re module which finds normal capital letters [A-Z] :
0.9373950958251953 seconds

>>Function with re module which finds all kinds of capital letters :
1.0783331394195557 seconds

To match only 1+ letter chunks that contain at least 1 uppercase Unicode letter you may use

import re, sys, unicodedata

pLu = '[{}]'.format("".join([chr(i) for i in range(sys.maxunicode) if chr(i).isupper()]))
p = re.compile(r"[^\W\d_]*{Lu}[^\W\d_]*".format(Lu=pLu))

sentence = 'nous avons 2 Đồng et 3 Ęułro.+++++++++++++++Next line'
print(p.findall(sentence))
# => ['Đồng', 'Ęułro', 'Next']

The pLu is a Unicode letter character class pattern built dynamically using unicodedata . It is dependent on the Python version, use the latest to include as many Unicode uppercase letters as possible (see this answer for more details, too ). The [^\\W\\d_] is a construct matching any Unicode letter . So, the pattern matches any 0+ Unicode letters, followed with at least 1 Unicode uppercase letter, and then having any 0+ Unicode letters.

Note that your original r'[az]*[AZ]+[az]*' will only find Next in this input:

print(re.findall(r'[a-z]*[A-Z]+[a-z]*', sentence)) # => ['Next']

See the Python demo

To match the words as whole words, use \\b word boundary:

p = re.compile(r"\b[^\W\d_]*{Lu}[^\W\d_]*\b".format(Lu=pLu))

In case you want to use Python 2.x, do not forget to use re.U flag to make the \\W , \\d and \\b Unicode aware. However, it is recommended to use the latest PyPi regex library and its [[:upper:]] / \\p{Lu} constructs to match uppercase letters since it will support the up-to-date list of Unicode letters.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM