简体   繁体   English

Python findall 中的多个正则表达式

[英]Multiple regex in Python findall

Say I have a string : "She has an excellent command on the topicsOnly problem is clarity in EnglishHer confidence is very good in RUSSian and H2O"假设我有一个字符串:“她对主题有很好的掌握唯一的问题是英语的清晰度她对俄语和 H2O 的信心非常好”

If observed properly, this string doesnt have any punctuation.如果观察得当,这个字符串没有任何标点符号。 I am primarily focusing on putting the periods.我主要专注于放置句点。 "She has an excellent command on the topics. Only problem is clarity in English. Her confidence is very good in RUSSian and H2O" I can use a regex and findall to get a list of relevant words. “她对主题有很好的掌握。唯一的问题是英语的清晰度。她对俄语和 H2O 非常自信”我可以使用正则表达式和 findall 来获取相关单词的列表。 I tried using something like this, but its not giving the desired result.我尝试使用这样的东西,但它没有给出想要的结果。 I would like a computationally efficient code.我想要一个计算效率高的代码。

import re

text = "She has an excelllent command on the topicsOnly problem is clarity in EnglishHer confidence is very good in RUSSian and H2O"

r = re.findall('([A-Z][a-z]+)|([a-zA-Z0-9]+)|([A-Z][a-z]+)', text)

I tried something like that with the PCRE engine : (\\p{Ll}+)(\\p{Lu}\\p{Ll}*)我用 PCRE 引擎尝试了类似的东西: (\\p{Ll}+)(\\p{Lu}\\p{Ll}*)

You can test it here: https://regex101.com/r/tqIcdS/1你可以在这里测试: https : //regex101.com/r/tqIcdS/1

The idea is to use the \\p{L} to find any word character (like \\w ) but with handling unicode chars that might have accents (ex: " Le pain, je l'ai mangéEnsuite j'ai bu un verre de vin ").这个想法是使用\\p{L}来查找任何单词字符(如\\w ),但要处理可能带有重音的 unicode 字符(例如:“ Le pain, je l'ai mangéEnsuite j'ai bu un verre de vin ”)。

  • \\p{Ll} matches a lowercase unicode word character. \\p{Ll}匹配一个小写的 unicode 单词字符。

  • \\p{Lu} matches an uppercase unicode word character. \\p{Lu}匹配一个大写的 unicode 单词字符。

I also captured the characters before and after to match the whole word.我还捕获了前后的字符以匹配整个单词。

Unfortunately, Python 's default re library doesn't support it.不幸的是, Python的默认re库不支持它。

But thanks to Wiktor's comment below, you could use the PyPi regex library: https://pypi.org/project/regex/但多亏了 Wiktor 在下面的评论,您可以使用PyPi 正则表达式库: https : //pypi.org/project/regex/

You can use built-in Python re for both ASCII and fully Unicode-aware solutions:您可以将内置 Python re用于 ASCII 和完全识别 Unicode 的解决方案:

import re, sys

pLu = '[{}]'.format("".join([chr(i) for i in range(sys.maxunicode) if chr(i).isupper()]))
pLl = '[{}]'.format("".join([chr(i) for i in range(sys.maxunicode) if chr(i).islower()]))

text = "She has an excelllent command on the topicsOnly problem is clarity in EnglishHer confidence is very good in RUSSian and H2O"
print( re.sub(fr'({pLl})({pLu})', r'\1. \2', text) ) # Unicode-aware
# => She has an excelllent command on the topics. Only problem is clarity in English. Her confidence is very good in RUSSian and H2O
print( re.sub(fr'([a-z])([A-Z])', r'\1. \2', text) ) # ASCII only
# => She has an excelllent command on the topics. Only problem is clarity in English. Her confidence is very good in RUSSian and H2O

See the Python demo .请参阅Python 演示

The main idea is to match and capture a lowercase letter and then an uppercase letter ( ([az])([AZ]) ) and replace with Group 1 value + .主要思想是匹配并捕获一个小写字母,然后是一个大写字母 ( ([az])([AZ]) ) 并替换为 Group 1 value + . and space and then Group 2 value, where \\1 and \\2 are backreferences to these group values.和空格,然后是组 2 值,其中\\1\\2是对这些组值的反向引用。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM