简体   繁体   English

正则表达式模式在某个 substring 之后查找 x 长度的 n 个非空格字符

[英]Regex pattern to find n non-space characters of x length after a certain substring

I am using this regex pattern pattern = r'cig[\s:.]*(\w{10})' to extract the 10 characters after the '''cig''' contained in each line of my dataframe.我正在使用这个正则表达式模式pattern = r'cig[\s:.]*(\w{10})'来提取 dataframe 每行中包含的 '''cig''' 之后的 10 个字符。 With this pattern I am accounting for all cases, except for the ones where that substring contains some spaces inside it.使用这种模式,我会考虑所有情况,除了 substring 内部包含一些空格的情况。

For example, I am trying to extract Z9F27D2198 from the string例如,我试图从字符串中提取Z9F27D2198

/BENEF/FORNITURA GAS FEB-20 CIG Z9F                 27D2198 01762-0000031

In the previous string, it seems like Stack overflow formatted it, but there should be 17 whitespaces between F and 2 , after CIG .在前面的字符串中,似乎是堆栈溢出对其进行了格式化,但在CIG之后的F2之间应该有 17 个空格。

Could you help me to edit the regex pattern in order to account for the white spaces in that 10-characters substring?您能帮我编辑正则表达式模式以说明 10 个字符 substring 中的空格吗? I am also using flags=re.I to ignore the case of the strings in my re.findall calls.我还使用flags=re.I来忽略re.findall调用中字符串的大小写。

To give an example string for which this pattern works:给出此模式适用的示例字符串:

CIG7826328A2B FORNITURA ENERGIA ELETTRICA U TENZE COMUNALI CONVENZIONE CONSIP E

and it outputs what I want: 7826328A2B .它输出我想要的: 7826328A2B

Thanks in advance.提前致谢。

You can use您可以使用

r'(?i)cig[\s:.]*(\S(?:\s*\S){9})(?!\S)'

See the regex demo .请参阅正则表达式演示 Details :详情

  • cig - a cig string cig - cig字符串
  • [\s:.]* - zero or more whitespaces, : or . [\s:.]* - 零个或多个空格, :.
  • (\S(?:\s*\S){9}) - Group 1: a non-whitespace char and then nine occurrences of zero or more whitespaces followed with a non-whitespace char (\S(?:\s*\S){9}) - 第 1 组:一个非空白字符,然后出现九个零个或多个空白字符,后跟一个非空白字符
  • (?!\S) - immediately to the right, there must be a whitespace or end of string. (?!\S) - 紧靠右边,必须有空格或字符串结尾。

In Python, you can use在 Python 中,您可以使用

import re
text = "/BENEF/FORNITURA GAS FEB-20 CIG Z9F               27D2198 01762-0000031"
pattern = r'cig[\s:.]*(\S(?:\s*\S){9})(?!\S)'
matches = re.finditer(pattern, text, re.I)
for match in matches:
  print(re.sub(r'\s+', '', match.group(1)), ' found at ', match.span(1))

# => Z9F27D2198  found at  (32, 57)

See the Python demo .请参阅Python 演示

What about:关于什么:

# removes all white spaces with replace()

x = 'CIG7826328A2B FORNITURA ENERGIA ELETTRICA U'.replace(' ', '')
x = x.split("CIG")[1][:10] 
# x = '7826328A2B'

x = '/BENEF/FORNITURA GAS FEB-20 CIG Z9F 27D2198 01762-0000031'.replace(' ', '')
x.split("CIG")[1][:10]
# x = '7826328A2B'

Works fine if there is only one "CIG" in the string如果字符串中只有一个“CIG”,则可以正常工作

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM