[英]Regex pattern to find n non-space characters of x length after a certain substring
I am using this regex pattern pattern = r'cig[\s:.]*(\w{10})'
to extract the 10 characters after the '''cig''' contained in each line of my dataframe.我正在使用这个正则表达式模式
pattern = r'cig[\s:.]*(\w{10})'
来提取 dataframe 每行中包含的 '''cig''' 之后的 10 个字符。 With this pattern I am accounting for all cases, except for the ones where that substring contains some spaces inside it.使用这种模式,我会考虑所有情况,除了 substring 内部包含一些空格的情况。
For example, I am trying to extract Z9F27D2198
from the string例如,我试图从字符串中提取
Z9F27D2198
/BENEF/FORNITURA GAS FEB-20 CIG Z9F 27D2198 01762-0000031
In the previous string, it seems like Stack overflow formatted it, but there should be 17 whitespaces between F
and 2
, after CIG
.在前面的字符串中,似乎是堆栈溢出对其进行了格式化,但在
CIG
之后的F
和2
之间应该有 17 个空格。
Could you help me to edit the regex pattern in order to account for the white spaces in that 10-characters substring?您能帮我编辑正则表达式模式以说明 10 个字符 substring 中的空格吗? I am also using
flags=re.I
to ignore the case of the strings in my re.findall
calls.我还使用
flags=re.I
来忽略re.findall
调用中字符串的大小写。
To give an example string for which this pattern works:给出此模式适用的示例字符串:
CIG7826328A2B FORNITURA ENERGIA ELETTRICA U TENZE COMUNALI CONVENZIONE CONSIP E
and it outputs what I want: 7826328A2B
.它输出我想要的:
7826328A2B
。
Thanks in advance.提前致谢。
You can use您可以使用
r'(?i)cig[\s:.]*(\S(?:\s*\S){9})(?!\S)'
See the regex demo .请参阅正则表达式演示。 Details :
详情:
cig
- a cig
string cig
- cig
字符串[\s:.]*
- zero or more whitespaces, :
or .
[\s:.]*
- 零个或多个空格, :
或.
(\S(?:\s*\S){9})
- Group 1: a non-whitespace char and then nine occurrences of zero or more whitespaces followed with a non-whitespace char (\S(?:\s*\S){9})
- 第 1 组:一个非空白字符,然后出现九个零个或多个空白字符,后跟一个非空白字符(?!\S)
- immediately to the right, there must be a whitespace or end of string. (?!\S)
- 紧靠右边,必须有空格或字符串结尾。 In Python, you can use在 Python 中,您可以使用
import re
text = "/BENEF/FORNITURA GAS FEB-20 CIG Z9F 27D2198 01762-0000031"
pattern = r'cig[\s:.]*(\S(?:\s*\S){9})(?!\S)'
matches = re.finditer(pattern, text, re.I)
for match in matches:
print(re.sub(r'\s+', '', match.group(1)), ' found at ', match.span(1))
# => Z9F27D2198 found at (32, 57)
See the Python demo .请参阅Python 演示。
What about:关于什么:
# removes all white spaces with replace()
x = 'CIG7826328A2B FORNITURA ENERGIA ELETTRICA U'.replace(' ', '')
x = x.split("CIG")[1][:10]
# x = '7826328A2B'
x = '/BENEF/FORNITURA GAS FEB-20 CIG Z9F 27D2198 01762-0000031'.replace(' ', '')
x.split("CIG")[1][:10]
# x = '7826328A2B'
Works fine if there is only one "CIG" in the string如果字符串中只有一个“CIG”,则可以正常工作
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.