简体   繁体   English

正则表达式两组匹配一切直到模式

[英]regex two group matches everything until pattern

I have the following examples: 我有以下示例:

Tortillas Bolsa 2a 1kg 4118
Tortillinas 50p 1 31Kg TAB TR 46113
Bollos BK 4in 36p 1635g SL 131
Super Pan Bco Ajonjoli 680g SP WON 100  
Pan Blanco Bimbo Rendidor 567g BIM 49973
Gansito ME 5p 250g MTA MLA 49860

Where I want to keep everything before the number but I also don't want the two uppercase letter word example: ME, BK . 我希望在数字之前保留所有内容,但我也不想要两个大写字母的例子: ME, BK I'm using ^((\\D*).*?) [^AZ]{2,3} 我正在使用^((\\D*).*?) [^AZ]{2,3}

The expected result should be 预期的结果应该是

Tortillas Bolsa
Tortillinas
Bollos
Super Pan Bco Ajonjoli
Pan Blanco Bimbo Rendidor
Gansito

With the regex I'm using I'm still getting the two capital letter words Bollos BK and Gansito ME 随着我正在使用的正则表达式,我仍然得到两个大写字母Bollos BKGansito ME

Pre-compile a regex pattern with a lookahead (explained below) and employ regex.match inside a list comprehension: 使用前瞻预编译正则表达式模式(如下所述)并在列表regex.match使用regex.match

>>> import re
>>> p = re.compile(r'\D+?(?=\s*([A-Z]{2})?\s*\d)')
>>> [p.match(x).group() for x in data]

[
 'Tortillas Bolsa',
 'Tortillinas',
 'Bollos',
 'Super Pan Bco Ajonjoli',
 'Pan Blanco Bimbo Rendidor',
 'Gansito'
]

Here, data is your list of strings. 在这里, data是您的字符串列表。

Details 细节

\D+?            # anything that isn't a digit (non-greedy)
(?=             # regex-lookahead
\s*             # zero or more wsp chars
([A-Z]{2})?     # two optional uppercase letters
\s*   
\d              # digit
)

In the event of any string not containing the pattern you're looking for, the list comprehension will error out (with an AttributeError), since re.match returns None in that instance. 如果任何字符串不包含您正在查找的模式,则列表re.match将出错(使用AttributeError),因为re.match在该实例中返回None You can then employ a loop and test the value of re.match before extracting the matched portion. 然后,您可以使用循环并在提取匹配部分之前测试re.match的值。

matches = []
for x in data:
    m = p.match(x)
    if m:
        matches.append(m.group())

Or, if you want a placeholder None when there's no match: 或者,如果你想有一个占位符, None时没有匹配:

matches = []
for x in data:
    matches.append(m.group() if m else None)

You may use the lookahead feature: 您可以使用前瞻功能:

I_WANT        = '(.+?)' # This is what you want
I_DO_NOT_WANT = '\s(?:[0-9]|(?:[A-Z]{2,3}\s))' # Stop-patterns
RE = '{}(?={})'.format(I_WANT, I_DO_NOT_WANT) # Combine the parts

[re.findall(RE, x)[0] for x in test_strings]
#['Tortillas Bolsa', 'Tortillinas', 'Bollos', 'Super Pan Bco Ajonjoli',
# 'Pan Blanco Bimbo Rendidor', 'Gansito']

Supposing that: 假设:

  • All the words you want to match in your capture group start with an uppercase letter 您想要在捕获组中匹配的所有单词都以大写字母开头
  • The rest of each word contains only lowercase letters 每个单词的其余部分仅包含小写字母
  • Words are separated by a single space 单词由单个空格分隔

...you can use the following regular expressions: ...您可以使用以下正则表达式:

  1. Using Unicode character properties : 使用Unicode字符属性

     ^((\\p{Lu}\\p{Ll}+ )+) 

    > Try this regex on regex101. >在regex101上试试这个正则表达式。

  2. Without Unicode support: 没有Unicode支持:

     ^(([Az][az]+ )+) 

    > Try this regex on regex101. >在regex101上试试这个正则表达式。

I suggest splitting on the first two uppercase letter word or a digit and grab the first item: 我建议拆分前两个大写字母或数字并抓住第一个项目:

r = re.compile(r'\b[A-Z]{2}\b|\d')
[r.split(item)[0].strip() for item in my_list]
# => ['Tortillas Bolsa', 'Tortillinas', 'Bollos', 'Super Pan Bco Ajonjoli', 'Pan Blanco Bimbo Rendidor', 'Gansito']

See the Python demo 请参阅Python演示

Pattern details 图案细节

  • \\b[AZ]{2}\\b - a whole (since \\b are word boundaries) two uppercase ASCII letter word \\b[AZ]{2}\\b - 一个整体(因为\\b是字边界)两个大写的ASCII字母字
  • | - or - 要么
  • \\d - a digit. \\d - 一个数字。

With .strip() , all trailing and leading whitespace will get trimmed. 使用.strip() ,所有尾随和前导空格都将被修剪。

A slight variation for a re.sub : re.sub :略有变化:

re.sub(r'\s*(?:\b[A-Z]{2}\b|\d).*', '', s)

See the regex demo 请参阅正则表达式演示

Details 细节

  • \\s* - 0+ whitespace chars \\s* - 0+空白字符
  • (?:\\b[AZ]{2}\\b|\\d) - either a two uppercase letter word or a digit (?:\\b[AZ]{2}\\b|\\d) - 两个大写字母或数字
  • .* - the rest of the line. .* - 其余部分。

My 2 cents 我的2美分

^.*?(?=\s[\d]|\s[A-Z]{2,})

https://regex101.com/r/7xD7DS/1/ https://regex101.com/r/7xD7DS/1/

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM