[英]regex: remove 's' from the end of every word except starting from capital letter but not at the beginning?
[英]How to find a specific, pre-defined word surrounded by any word(s) starting with a capital letter(s)?
我一直在分析大量的文本數據。 這是我到目前為止得到的:
(([A-Z][\w-]*)+\s+(\b(Study|Test)\b)(\s[A-Z][\w-]*)*)|(\b(Study|Test)\b)(\s[A-Z][\w-]*)+
我想捕捉的短語類型:
我只想捕獲單詞“學習”或“測試”,前提是它被以大寫字母開頭的單詞包圍。 理想的正則表達式將實現所有這些 + 它會忽略\轉義某些單詞,例如“of”或“the”。
您可以改為使用 2 個捕獲組,並匹配以左側或右側大寫 AZ 開頭的單個單詞。
使用[^\S\r\n]
將匹配沒有換行符的空白字符,因為\s
可以匹配換行符
\b[A-Z]\w*[^\S\r\n]+(Test|Study)\b|\b(Test|Study)[^\S\r\n]+[A-Z]\w*
好的,這可能與實際的 scope 不同,但您可以將較新的regex
模塊與子例程一起使用:
(?(DEFINE)
(?<marker>\b[A-Z][-\w]*\b)
(?<ws>[\ \t]+)
(?<needle>\b(?:Study|Test))
(?<pre>(?:(?&marker)(?&ws))+)
(?<post>(?:(?&ws)(?&marker))+)
(?<before>(?&pre)(?&needle))
(?<after>(?&needle)(?&post))
(?<both>(?&pre)(?&needle)(?&post))
)
(?&both)|(?&before)|(?&after)
請參閱regex101.com 上的演示(並注意修飾符。)。
在實際代碼中,這可能是:
import regex as re
junk = """
I have been analyzing large amounts of text data. This is what I got so far:
(([A-Z][\w-]*)+\s+(\b(Study|Test)\b)(\s[A-Z][\w-]*)*)|(\b(Study|Test)\b)(\s[A-Z][\w-]*)+
Types of phrases I would like to capture:
Europe National Longitudinal Study
Longitudinal Study
Study Initiative
Longitudinal Study Initiative
I want to capture the word 'Study' or 'Test' ONLY if it is surrounded by the words starting with a capital letter. The ideal regex would achieve all of this + it would ignore\escape certain words like 'of' or 'the'.
*the above regex is super slow with the str.findall function, I guess there must be a better solution
** I used https://regex101.com for testing and then run it in Jupyter, Python 3
"""
pattern = re.compile(r'''
(?(DEFINE)
(?<marker>\b[A-Z][-\w]*\b)
(?<ws>[\ \t]+)
(?<needle>\b(?:Study|Test))
(?<pre>(?:(?&marker)(?&ws))+)
(?<post>(?:(?&ws)(?&marker))+)
(?<before>(?&pre)(?&needle))
(?<after>(?&needle)(?&post))
(?<both>(?&pre)(?&needle)(?&post))
)
(?&both)|(?&before)|(?&after)''', re.VERBOSE)
for match in pattern.finditer(junk):
print(match.group(0))
並且會產生
Europe National Longitudinal Study
Longitudinal Study
Study Initiative
Longitudinal Study Initiative
((?:[A-Z]\w+\s+){0,5}\bStudy\b\s*(?:[A-Z]\w+\b\s*){0,5})
我必須進一步測試它以檢查它是否適用於現實世界中所有可能的場景。 不過,我可能需要將表達式中的“5”調整為更低或更高的數字,以優化我的算法性能。 我已經在一些真實的數據集上對其進行了測試,到目前為止結果很有希望。 它很快。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.