Python findall 中的多個正則表達式

Question

假設我有一個字符串：“她對主題有很好的掌握唯一的問題是英語的清晰度她對俄語和 H2O 的信心非常好”

如果觀察得當，這個字符串沒有任何標點符號。 我主要專注於放置句點。 “她對主題有很好的掌握。唯一的問題是英語的清晰度。她對俄語和 H2O 非常自信”我可以使用正則表達式和 findall 來獲取相關單詞的列表。 我嘗試使用這樣的東西，但它沒有給出想要的結果。 我想要一個計算效率高的代碼。

import re

text = "She has an excelllent command on the topicsOnly problem is clarity in EnglishHer confidence is very good in RUSSian and H2O"

r = re.findall('([A-Z][a-z]+)|([a-zA-Z0-9]+)|([A-Z][a-z]+)', text)

Answer 1

我用 PCRE 引擎嘗試了類似的東西： (\\p{Ll}+)(\\p{Lu}\\p{Ll}*)

你可以在這里測試： https : //regex101.com/r/tqIcdS/1

這個想法是使用\\p{L}來查找任何單詞字符（如\\w ），但要處理可能帶有重音的 unicode 字符（例如：“ Le pain, je l'ai mangéEnsuite j'ai bu un verre de vin ”）。

\\p{Ll}匹配一個小寫的 unicode 單詞字符。
\\p{Lu}匹配一個大寫的 unicode 單詞字符。

我還捕獲了前后的字符以匹配整個單詞。

不幸的是， Python的默認re庫不支持它。

但多虧了 Wiktor 在下面的評論，您可以使用PyPi 正則表達式庫： https : //pypi.org/project/regex/

Answer 2

您可以將內置 Python re用於 ASCII 和完全識別 Unicode 的解決方案：

import re, sys

pLu = '[{}]'.format("".join([chr(i) for i in range(sys.maxunicode) if chr(i).isupper()]))
pLl = '[{}]'.format("".join([chr(i) for i in range(sys.maxunicode) if chr(i).islower()]))

text = "She has an excelllent command on the topicsOnly problem is clarity in EnglishHer confidence is very good in RUSSian and H2O"
print( re.sub(fr'({pLl})({pLu})', r'\1. \2', text) ) # Unicode-aware
# => She has an excelllent command on the topics. Only problem is clarity in English. Her confidence is very good in RUSSian and H2O
print( re.sub(fr'([a-z])([A-Z])', r'\1. \2', text) ) # ASCII only
# => She has an excelllent command on the topics. Only problem is clarity in English. Her confidence is very good in RUSSian and H2O

請參閱Python 演示。

主要思想是匹配並捕獲一個小寫字母，然后是一個大寫字母 ( ([az])([AZ]) ) 並替換為 Group 1 value + . 和空格，然后是組 2 值，其中\\1和\\2是對這些組值的反向引用。

Python findall 中的多個正則表達式

問題描述

2 個解決方案

解決方案1
1 2021-07-02 09:36:06

解決方案2
0 2021-07-02 10:48:55

Python findall 中的多個正則表達式

問題描述

2 個解決方案

解決方案1 1 2021-07-02 09:36:06

解決方案2 0 2021-07-02 10:48:55

解決方案1
1 2021-07-02 09:36:06

解決方案2
0 2021-07-02 10:48:55