[英]regex two group matches everything until pattern
I have the following examples: 我有以下示例:
Tortillas Bolsa 2a 1kg 4118
Tortillinas 50p 1 31Kg TAB TR 46113
Bollos BK 4in 36p 1635g SL 131
Super Pan Bco Ajonjoli 680g SP WON 100
Pan Blanco Bimbo Rendidor 567g BIM 49973
Gansito ME 5p 250g MTA MLA 49860
Where I want to keep everything before the number but I also don't want the two uppercase letter word example: ME, BK
. 我希望在数字之前保留所有内容,但我也不想要两个大写字母的例子: ME, BK
。 I'm using ^((\\D*).*?) [^AZ]{2,3}
我正在使用^((\\D*).*?) [^AZ]{2,3}
The expected result should be 预期的结果应该是
Tortillas Bolsa
Tortillinas
Bollos
Super Pan Bco Ajonjoli
Pan Blanco Bimbo Rendidor
Gansito
With the regex I'm using I'm still getting the two capital letter words Bollos BK
and Gansito ME
随着我正在使用的正则表达式,我仍然得到两个大写字母Bollos BK
和Gansito ME
Pre-compile a regex pattern with a lookahead (explained below) and employ regex.match
inside a list comprehension: 使用前瞻预编译正则表达式模式(如下所述)并在列表regex.match
使用regex.match
:
>>> import re
>>> p = re.compile(r'\D+?(?=\s*([A-Z]{2})?\s*\d)')
>>> [p.match(x).group() for x in data]
[
'Tortillas Bolsa',
'Tortillinas',
'Bollos',
'Super Pan Bco Ajonjoli',
'Pan Blanco Bimbo Rendidor',
'Gansito'
]
Here, data
is your list of strings. 在这里, data
是您的字符串列表。
Details 细节
\D+? # anything that isn't a digit (non-greedy)
(?= # regex-lookahead
\s* # zero or more wsp chars
([A-Z]{2})? # two optional uppercase letters
\s*
\d # digit
)
In the event of any string not containing the pattern you're looking for, the list comprehension will error out (with an AttributeError), since re.match
returns None
in that instance. 如果任何字符串不包含您正在查找的模式,则列表re.match
将出错(使用AttributeError),因为re.match
在该实例中返回None
。 You can then employ a loop and test the value of re.match
before extracting the matched portion. 然后,您可以使用循环并在提取匹配部分之前测试re.match
的值。
matches = []
for x in data:
m = p.match(x)
if m:
matches.append(m.group())
Or, if you want a placeholder None
when there's no match: 或者,如果你想有一个占位符, None
时没有匹配:
matches = []
for x in data:
matches.append(m.group() if m else None)
You may use the lookahead feature: 您可以使用前瞻功能:
I_WANT = '(.+?)' # This is what you want
I_DO_NOT_WANT = '\s(?:[0-9]|(?:[A-Z]{2,3}\s))' # Stop-patterns
RE = '{}(?={})'.format(I_WANT, I_DO_NOT_WANT) # Combine the parts
[re.findall(RE, x)[0] for x in test_strings]
#['Tortillas Bolsa', 'Tortillinas', 'Bollos', 'Super Pan Bco Ajonjoli',
# 'Pan Blanco Bimbo Rendidor', 'Gansito']
Supposing that: 假设:
...you can use the following regular expressions: ...您可以使用以下正则表达式:
Using Unicode character properties : 使用Unicode字符属性 :
^((\\p{Lu}\\p{Ll}+ )+)
Without Unicode support: 没有Unicode支持:
^(([Az][az]+ )+)
I suggest splitting on the first two uppercase letter word or a digit and grab the first item: 我建议拆分前两个大写字母或数字并抓住第一个项目:
r = re.compile(r'\b[A-Z]{2}\b|\d')
[r.split(item)[0].strip() for item in my_list]
# => ['Tortillas Bolsa', 'Tortillinas', 'Bollos', 'Super Pan Bco Ajonjoli', 'Pan Blanco Bimbo Rendidor', 'Gansito']
See the Python demo 请参阅Python演示
Pattern details 图案细节
\\b[AZ]{2}\\b
- a whole (since \\b
are word boundaries) two uppercase ASCII letter word \\b[AZ]{2}\\b
- 一个整体(因为\\b
是字边界)两个大写的ASCII字母字 |
- or - 要么 \\d
- a digit. \\d
- 一个数字。 With .strip()
, all trailing and leading whitespace will get trimmed. 使用.strip()
,所有尾随和前导空格都将被修剪。
A slight variation for a re.sub
: re.sub
:略有变化:
re.sub(r'\s*(?:\b[A-Z]{2}\b|\d).*', '', s)
See the regex demo 请参阅正则表达式演示
Details 细节
\\s*
- 0+ whitespace chars \\s*
- 0+空白字符 (?:\\b[AZ]{2}\\b|\\d)
- either a two uppercase letter word or a digit (?:\\b[AZ]{2}\\b|\\d)
- 两个大写字母或数字 .*
- the rest of the line. .*
- 其余部分。 My 2 cents 我的2美分
^.*?(?=\s[\d]|\s[A-Z]{2,})
https://regex101.com/r/7xD7DS/1/ https://regex101.com/r/7xD7DS/1/
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.