正则表达式仅匹配粘在一起的大写单词（即不被空格分隔）

Question

I have a long list of strings which are all random words, all of them capitalized, such as 'Pomegranate' and 'Yellow Banana' .我有一长串字符串，它们都是随机单词，全部大写，例如'Pomegranate'和'Yellow Banana' 。 However, some of them are stuck together, like so: 'AppleOrange' .但是，其中一些是粘在一起的，例如： 'AppleOrange' 。 There are no special characters or digits.没有特殊字符或数字。

What I need is a regular expression on Python that matches 'Apple' and 'Orange' separately, but not 'Pomegranate' or 'Yellow' .我需要的是 Python 上的正则表达式，它分别匹配'Apple'和'Orange' ，但不匹配'Pomegranate'或'Yellow' 。

As expected, I'm very new to this, and I've only managed to write r"(?<!\s)([AZ][az]*)" ... But that still matches 'Yellow' and 'Pomegranate' .正如预期的那样，我对此很陌生，我只设法写了r"(?<!\s)([AZ][az]*)" ...但这仍然匹配'Yellow'和'Pomegranate' 。 How do I do this?我该怎么做呢？

Answer 1

This work:这项工作：

import re
from collections import deque

pattern = r'([A-Z]{2,}(?=[A-Z]|$)|[A-Z](?=[a-z]|$))'
chunks = deque(re.split(pattern, 'AppleOrange'))

result = []
while len(chunks):
  buf = chunks.popleft()
  if len(buf) == 0:
    continue
  if re.match(r'^[A-Z]$', buf) and len(chunks):
    buf += chunks.popleft()
  result.append(buf)

print(result)

Output: Output：

['Apple', 'Orange']

Check the OP here在这里检查 OP

Answer 2

If they all start with an uppercase char and optional lowercase chars, you can make use of lookarounds and an alternation to match both variations如果它们都以大写字符和可选的小写字符开头，则可以使用环视和交替来匹配这两种变体

(?<=[a-z])[A-Z][a-z]*|[A-Z][a-z]*(?=[A-Z])

The pattern matches:模式匹配：

(?<=[az]) Assert az to the left (?<=[az])向左断言 az
[AZ][az]* match AZ and optional chars az [AZ][az]*匹配 AZ 和可选字符 az
| or或者
[AZ][az]* match AZ and optional chars az [AZ][az]*匹配 AZ 和可选字符 az
(?=[AZ]) Assert AZ to the right (?=[AZ])在右边断言 AZ

Regex demo正则表达式演示

Example例子

import re

pattern = r"(?<=[a-z])[A-Z][a-z]*|[A-Z][a-z]*(?=[A-Z])"
s = ("AppleOrange\nPomegranate Yellow Banana")

print(re.findall(pattern, s))

Output Output

['Apple', 'Orange']

Another option could be getting out of the way what you don't want by matching it, and use a capture group for what you want to keep and remove the empty entries from the result:另一种选择可能是通过匹配它来避开您不想要的东西，并使用捕获组来保留您想要保留的内容并从结果中删除空条目：

(?<!\S)[A-Z][a-z]*(?!\S)|([A-Z][a-z]*)

Regex demo |正则表达式演示| Python demo Python 演示

import re

pattern = r"(?<!\S)[A-Z][a-z]*(?!\S)|([A-Z][a-z]*)"
s = ("AppleOrange\nPomegranate Yellow Banana")

print([x for x in re.findall(pattern, s) if x])

正则表达式仅匹配粘在一起的大写单词（即不被空格分隔）

问题描述

2 个解决方案

解决方案1
1 2022-01-31 22:46:50

解决方案2
0 已采纳 2022-01-31 22:40:51

正则表达式仅匹配粘在一起的大写单词（即不被空格分隔）

问题描述

2 个解决方案

解决方案1 1 2022-01-31 22:46:50

解决方案2 0 已采纳 2022-01-31 22:40:51

解决方案1
1 2022-01-31 22:46:50

解决方案2
0 已采纳 2022-01-31 22:40:51