简体   繁体   English

正则表达式仅匹配粘在一起的大写单词(即不被空格分隔)

[英]Regex for matching only capitalized words stuck together (i.e. not separated by whitespace)

I have a long list of strings which are all random words, all of them capitalized, such as 'Pomegranate' and 'Yellow Banana' .我有一长串字符串,它们都是随机单词,全部大写,例如'Pomegranate''Yellow Banana' However, some of them are stuck together, like so: 'AppleOrange' .但是,其中一些是粘在一起的,例如: 'AppleOrange' There are no special characters or digits.没有特殊字符或数字。

What I need is a regular expression on Python that matches 'Apple' and 'Orange' separately, but not 'Pomegranate' or 'Yellow' .我需要的是 Python 上的正则表达式,它分别匹配'Apple''Orange' ,但不匹配'Pomegranate''Yellow'

As expected, I'm very new to this, and I've only managed to write r"(?<!\s)([AZ][az]*)" ... But that still matches 'Yellow' and 'Pomegranate' .正如预期的那样,我对此很陌生,我只设法写了r"(?<!\s)([AZ][az]*)" ...但这仍然匹配'Yellow''Pomegranate' How do I do this?我该怎么做呢?

This work:这项工作:

import re
from collections import deque

pattern = r'([A-Z]{2,}(?=[A-Z]|$)|[A-Z](?=[a-z]|$))'
chunks = deque(re.split(pattern, 'AppleOrange'))

result = []
while len(chunks):
  buf = chunks.popleft()
  if len(buf) == 0:
    continue
  if re.match(r'^[A-Z]$', buf) and len(chunks):
    buf += chunks.popleft()
  result.append(buf)

print(result)

Output: Output:

['Apple', 'Orange']

Check the OP here这里检查 OP

If they all start with an uppercase char and optional lowercase chars, you can make use of lookarounds and an alternation to match both variations如果它们都以大写字符和可选的小写字符开头,则可以使用环视和交替来匹配这两种变体

(?<=[a-z])[A-Z][a-z]*|[A-Z][a-z]*(?=[A-Z])

The pattern matches:模式匹配:

  • (?<=[az]) Assert az to the left (?<=[az])向左断言 az
  • [AZ][az]* match AZ and optional chars az [AZ][az]*匹配 AZ 和可选字符 az
  • | or或者
  • [AZ][az]* match AZ and optional chars az [AZ][az]*匹配 AZ 和可选字符 az
  • (?=[AZ]) Assert AZ to the right (?=[AZ])在右边断言 AZ

Regex demo正则表达式演示

Example例子

import re

pattern = r"(?<=[a-z])[A-Z][a-z]*|[A-Z][a-z]*(?=[A-Z])"
s = ("AppleOrange\nPomegranate Yellow Banana")

print(re.findall(pattern, s))

Output Output

['Apple', 'Orange']

Another option could be getting out of the way what you don't want by matching it, and use a capture group for what you want to keep and remove the empty entries from the result:另一种选择可能是通过匹配它来避开您不想要的东西,并使用捕获组来保留您想要保留的内容并从结果中删除空条目:

(?<!\S)[A-Z][a-z]*(?!\S)|([A-Z][a-z]*)

Regex demo |正则表达式演示| Python demo Python 演示

import re

pattern = r"(?<!\S)[A-Z][a-z]*(?!\S)|([A-Z][a-z]*)"
s = ("AppleOrange\nPomegranate Yellow Banana")

print([x for x in re.findall(pattern, s) if x])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM