[英]python regex find matched string
I am trying to find the matched string in a string using regex in Python. 我试图在Python中使用正则表达式在字符串中找到匹配的字符串。 The
string
looks like this: 该
string
如下所示:
band 1 # energy -53.15719532 # occ. 2.00000000
ion s p d tot
1 0.000 0.995 0.000 0.995
2 0.000 0.000 0.000 0.000
tot 0.000 0.996 0.000 0.996
band 2 # energy -53.15719532 # occ. 2.00000000
ion s p d tot
1 0.000 0.995 0.000 0.995
2 0.000 0.000 0.000 0.000
tot 0.000 0.996 0.000 0.996
band 3 # energy -53.15719532 # occ. 2.00000000
My goal is to find the string after tot
. 我的目标是在
tot
之后找到字符串。 So the matched string will be something like: 因此,匹配的字符串将类似于:
['0.000 0.996 0.000 0.996',
'0.000 0.996 0.000 0.996']
Here is my current code: 这是我当前的代码:
pattern = re.compile(r'tot\s+(.*?)\n', re.DOTALL)
pattern.findall(string)
However, the output gives me: 但是,输出给了我:
['1 0.000 0.995 0.000 0.995',
'0.000 0.996 0.000 0.996',
'1 0.000 0.995 0.000 0.995',
'0.000 0.996 0.000 0.996']
Any idea of what I am doing wrong? 任何我做错事的想法吗?
You don't want the DOTALL
flag. 您不需要
DOTALL
标志。 Remove it and use MULTILINE
instead. 删除它并改用
MULTILINE
。
pattern = re.compile(r'^\s*tot(.*)', re.MULTILINE)
This matches all lines that start with tot
. 这匹配以
tot
开头的所有行。 The rest of the line will be in group 1. 该行的其余部分将在第1组中。
Citing the documentation , emphasis mine: 引用文档 ,重点是:
re.DOTALL
Make the
'.'
标记为
'.'
special character match any character at all, including a newline ;特殊字符完全可以匹配任何字符, 包括换行符 ; without this flag,
'.'
没有此标志,
'.'
will match anything except a newline.将匹配换行符以外的任何内容。
Note that you can easily do this without regex. 请注意,无需正则表达式,您可以轻松地做到这一点。
with open("input.txt", "r") as data_file:
for line in data_file:
items = filter(None, line.split(" "))
if items[0] == "tot":
# etc
You are using re.DOTALL, which means that the dot "." 您正在使用re.DOTALL,这意味着点“。” will match anything, even newlines, in essence finding both "tot"-s and everything that follows until the next newline:
会匹配所有内容,甚至是换行符,从本质上来说,它会找到“ tot” -s以及下一个换行符之前的所有内容:
tot
1 0.000 0.995 0.000 0.995
and 和
tot 0.000 0.996 0.000 0.996
Removing re.DOTALL should fix your problem. 删除re.DOTALL应该可以解决您的问题。
Edit: Actually, the DOTALL flag is not really the issue (though unnecessary). 编辑:实际上,DOTALL标志不是真正的问题(尽管不必要)。 The problem in the pattern is that the \\s+ matches the newline.
模式中的问题是\\ s +与换行符匹配。 Replacing that with a single space solves that issue:
用单个空格代替可以解决此问题:
pattern = re.compile(r'tot (.*?)\n')
The alternative solution using re.findall
function with specific regex pattern: 使用
re.findall
函数和特定正则表达式模式的替代解决方案:
# str is your inital string
result = re.findall('tot [0-9 .]+(?=\n|$)', str)
print(result)
The output: 输出:
['tot 0.000 0.996 0.000 0.996', 'tot 0.000 0.996 0.000 0.996']
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.