[英]Regex for matching only capitalized words stuck together (i.e. not separated by whitespace)
I have a long list of strings which are all random words, all of them capitalized, such as 'Pomegranate'
and 'Yellow Banana'
.我有一长串字符串,它们都是随机单词,全部大写,例如
'Pomegranate'
和'Yellow Banana'
。 However, some of them are stuck together, like so: 'AppleOrange'
.但是,其中一些是粘在一起的,例如:
'AppleOrange'
。 There are no special characters or digits.没有特殊字符或数字。
What I need is a regular expression on Python that matches 'Apple'
and 'Orange'
separately, but not 'Pomegranate'
or 'Yellow'
.我需要的是 Python 上的正则表达式,它分别匹配
'Apple'
和'Orange'
,但不匹配'Pomegranate'
或'Yellow'
。
As expected, I'm very new to this, and I've only managed to write r"(?<!\s)([AZ][az]*)"
... But that still matches 'Yellow'
and 'Pomegranate'
.正如预期的那样,我对此很陌生,我只设法写了
r"(?<!\s)([AZ][az]*)"
...但这仍然匹配'Yellow'
和'Pomegranate'
。 How do I do this?我该怎么做呢?
This work:这项工作:
import re
from collections import deque
pattern = r'([A-Z]{2,}(?=[A-Z]|$)|[A-Z](?=[a-z]|$))'
chunks = deque(re.split(pattern, 'AppleOrange'))
result = []
while len(chunks):
buf = chunks.popleft()
if len(buf) == 0:
continue
if re.match(r'^[A-Z]$', buf) and len(chunks):
buf += chunks.popleft()
result.append(buf)
print(result)
Output: Output:
['Apple', 'Orange']
If they all start with an uppercase char and optional lowercase chars, you can make use of lookarounds and an alternation to match both variations如果它们都以大写字符和可选的小写字符开头,则可以使用环视和交替来匹配这两种变体
(?<=[a-z])[A-Z][a-z]*|[A-Z][a-z]*(?=[A-Z])
The pattern matches:模式匹配:
(?<=[az])
Assert az to the left (?<=[az])
向左断言 az[AZ][az]*
match AZ and optional chars az [AZ][az]*
匹配 AZ 和可选字符 az|
or[AZ][az]*
match AZ and optional chars az [AZ][az]*
匹配 AZ 和可选字符 az(?=[AZ])
Assert AZ to the right (?=[AZ])
在右边断言 AZExample例子
import re
pattern = r"(?<=[a-z])[A-Z][a-z]*|[A-Z][a-z]*(?=[A-Z])"
s = ("AppleOrange\nPomegranate Yellow Banana")
print(re.findall(pattern, s))
Output Output
['Apple', 'Orange']
Another option could be getting out of the way what you don't want by matching it, and use a capture group for what you want to keep and remove the empty entries from the result:另一种选择可能是通过匹配它来避开您不想要的东西,并使用捕获组来保留您想要保留的内容并从结果中删除空条目:
(?<!\S)[A-Z][a-z]*(?!\S)|([A-Z][a-z]*)
Regex demo |正则表达式演示| Python demo
Python 演示
import re
pattern = r"(?<!\S)[A-Z][a-z]*(?!\S)|([A-Z][a-z]*)"
s = ("AppleOrange\nPomegranate Yellow Banana")
print([x for x in re.findall(pattern, s) if x])
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.