简体   繁体   English

Python正则表达式返回空字符串

[英]python regex to return empty string

I want to extract part of a string in a list which does not have a space followed by number in python. 我想在列表中提取字符串的一部分,该列表在python中没有空格后跟数字。

# INPUT
text = ['bits', 'scrap 1.2', 'bits and pieces', 'junk 3.4.2']
# EXPECTED OUTPUT
output = ['bits', 'scrap', 'bits and pieces', 'junk']

I managed to do this using re.sub or re.split: 我设法使用re.sub或re.split来做到这一点:

output = [re.sub(" [0-9].*", "", t) for t in text]
# OR
output = [re.split(' \d',t)[0] for t in text]

When I tried to use re.search and re.findall, it return me empty list or empty result. 当我尝试使用re.search和re.findall时,它返回我一个空列表或空结果。

[re.search('(.*) \d', t) for t in text]
#[None, <_sre.SRE_Match object; span=(0, 7), match='scrap 1'>, None, <_sre.SRE_Match object; span=(0, 6), match='junk 3'>]

[re.findall('(.*?) \d', t) for t in text]
#[[], ['scrap'], [], ['junk']]

Can anyone help me with the regex that can return expected output for re.search and re.findall? 任何人都可以用正则表达式来帮助我,该正则表达式可以为re.search和re.findall返回预期的输出吗?

You may remove the digit-and-dot substrings at the end of the string only with 您只能使用以下命令删除字符串末尾的数字和点子字符串

import re
text = ['bits', 'scrap 1.2', 'bits and pieces', 'junk 3.4.2']
print([re.sub(r'\s+\d+(?:\.\d+)*$', '', x) for x in text])
# => output = ['bits', 'scrap', 'bits and pieces', 'junk']

See the Python demo 参见Python演示

The pattern is 模式是

  • \\s+ - 1+ whitespaces (note: if those digits can be "glued" to some other text, replace + (one or more occurrences) with * quantifier (zero or more occurrences)) \\s+ -1+空格(注意:如果可以将这些数字“粘合”到其他文本,则用*量(零个或多个出现)替换+ (一个或多个出现))
  • \\d+ - 1 or more digits \\d+ -1个或更多数字
  • (?:\\.\\d+)* - 0 or more sequences of (?:\\.\\d+)* -0个或多个序列
    • \\. - a dot -一个点
    • \\d+ - 1 or more digits \\d+ -1个或更多数字
  • $ - end of string. $ -字符串结尾。

See the regex demo . 参见regex演示

To do the same with re.findall , you can use 要对re.findall做同样的re.findall ,您可以使用

# To get 'abc 5.6 def' (not 'abc') from 'abc 5.6 def 6.8.9'
re.findall(r'^(.*?)(?: \d[\d.]*)?$', x) # 
# To get 'abc' (not 'abc 5.6 def') from 'abc 5.6 def 6.8.9'
re.findall(r'^(.*?)(?: \d.*)?$', x) # 

See this regex demo . 请参阅此正则表达式演示

However, this regex is not efficient enough due to the .*? 但是,由于.*? ,此正则表达式不够有效.*? construct. 构造。 Here, 这里,

  • ^ - start of string ^ -字符串开头
  • (.*?) - Group 1: any zero or more chars other than line break chars (use re.DOTALL to match all) as few as possible (so that the next optional group could be tested at each position) (.*?) -组1:除换行符以外的任何零个或多个字符(使用re.DOTALL来匹配所有字符)应尽可能少(以便可以在每个位置测试下一个可选组)
  • (?: \\d[\\d.]*)? -an optional non-capturing group matching -可选的非捕获组匹配
    • - a space - 空间
    • \\d - a digit \\d一个数字
    • [\\d.]* - zero or more digits or . [\\d.]* -零个或多个数字或. chars 字符
    • (OR) .* - any 0+ chars other than line break chars, as many as possible (OR) .* -尽可能多的除换行符以外的0+个字符
  • $ - end of string. $ -字符串结尾。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM