I'm facing an issue. Indeed, I work with vietnamese texts and I want to find every word containing uppercase(s) (capital letter). When I use the 're' module, my function (temp) does not catch word like "Đà". The other way (temp2) is to check each character at a time, it works but it is slow since I have to split the sentences into words.
Hence I would like to know if there is a way of the "re" module to catch all the special capital letter.
I have 2 ways :
def temp(sentence):
return re.findall(r'[a-z]*[A-Z]+[a-z]*', sentence)
lis=word_tokenize(sentence)
def temp2(lis):
proper_noun=[]
for word in lis:
for letter in word:
if letter.isupper():
proper_noun.append(word)
break
return proper_noun
Input:
'nous avons 2 Đồng et 3 Euro'
Expected output :
['Đồng','Euro']
Thank you!
You may use this regex:
\b\S*[AĂÂÁẮẤÀẰẦẢẲẨÃẴẪẠẶẬĐEÊÉẾÈỀẺỂẼỄẸỆIÍÌỈĨỊOÔƠÓỐỚÒỒỜỎỔỞÕỖỠỌỘỢUƯÚỨÙỪỦỬŨỮỤỰYÝỲỶỸỴAĂÂÁẮẤÀẰẦẢẲẨÃẴẪẠẶẬĐEÊÉẾÈỀẺỂẼỄẸỆIÍÌỈĨỊOÔƠÓỐỚÒỒỜỎỔỞÕỖỠỌỘỢUƯÚỨÙỪỦỬŨỮỤỰYÝỲỶỸỴAĂÂÁẮẤÀẰẦẢẲẨÃẴẪẠẶẬĐEÊÉẾÈỀẺỂẼỄẸỆIÍÌỈĨỊOÔƠÓỐỚÒỒỜỎỔỞÕỖỠỌỘỢUƯÚỨÙỪỦỬŨỮỤỰYÝỲỶỸỴAĂÂÁẮẤÀẰẦẢẲẨÃẴẪẠẶẬĐEÊÉẾÈỀẺỂẼỄẸỆIÍÌỈĨỊOÔƠÓỐỚÒỒỜỎỔỞÕỖỠỌỘỢUƯÚỨÙỪỦỬŨỮỤỰYÝỲỶỸỴAĂÂÁẮẤÀẰẦẢẲẨÃẴẪẠẶẬĐEÊÉẾÈỀẺỂẼỄẸỆIÍÌỈĨỊOÔƠÓỐỚÒỒỜỎỔỞÕỖỠỌỘỢUƯÚỨÙỪỦỬŨỮỤỰYÝỲỶỸỴAĂÂÁẮẤÀẰẦẢẲẨÃẴẪẠẶẬĐEÊÉẾÈỀẺỂẼỄẸỆIÍÌỈĨỊOÔƠÓỐỚÒỒỜỎỔỞÕỖỠỌỘỢUƯÚỨÙỪỦỬŨỮỤỰYÝỲỶỸỴA-Z]+\S*\b
The answer of @Rizwan M.Tuman is correct. I want to share with you the speed of execution of the three functions for 100,000 sentences.
lis=word_tokenize(sentence)
def temp(lis):
proper_noun=[]
for word in lis:
for letter in word:
if letter.isupper():
proper_noun.append(word)
break
return proper_noun
def temp2(sentence):
return re.findall(r'[a-z]*[A-Z]+[a-z]*', sentence)
def temp3(sentence):
return re.findall(capital_letter,sentence)
By this way:
start_time = time.time()
for k in range(100000):
temp2(sentence)
print("%s seconds" % (time.time() - start_time))
Here are the results:
>>Check each character of a list of words if it is a capital letter (.isupper())
(sentence has already been splitted into words)
0.4416656494140625 seconds
>>Function with re module which finds normal capital letters [A-Z] :
0.9373950958251953 seconds
>>Function with re module which finds all kinds of capital letters :
1.0783331394195557 seconds
To match only 1+ letter chunks that contain at least 1 uppercase Unicode letter you may use
import re, sys, unicodedata
pLu = '[{}]'.format("".join([chr(i) for i in range(sys.maxunicode) if chr(i).isupper()]))
p = re.compile(r"[^\W\d_]*{Lu}[^\W\d_]*".format(Lu=pLu))
sentence = 'nous avons 2 Đồng et 3 Ęułro.+++++++++++++++Next line'
print(p.findall(sentence))
# => ['Đồng', 'Ęułro', 'Next']
The pLu
is a Unicode letter character class pattern built dynamically using unicodedata
. It is dependent on the Python version, use the latest to include as many Unicode uppercase letters as possible (see this answer for more details, too ). The [^\\W\\d_]
is a construct matching any Unicode letter . So, the pattern matches any 0+ Unicode letters, followed with at least 1 Unicode uppercase letter, and then having any 0+ Unicode letters.
Note that your original r'[az]*[AZ]+[az]*'
will only find Next
in this input:
print(re.findall(r'[a-z]*[A-Z]+[a-z]*', sentence)) # => ['Next']
See the Python demo
To match the words as whole words, use \\b
word boundary:
p = re.compile(r"\b[^\W\d_]*{Lu}[^\W\d_]*\b".format(Lu=pLu))
In case you want to use Python 2.x, do not forget to use re.U
flag to make the \\W
, \\d
and \\b
Unicode aware. However, it is recommended to use the latest PyPi regex library and its [[:upper:]]
/ \\p{Lu}
constructs to match uppercase letters since it will support the up-to-date list of Unicode letters.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.