简体   繁体   English

Python从列表中获取所有正则表达式匹配组

[英]Python get all regexp matching groups from a list

Suppose I have read all the lines of a text file as follows:假设我已经阅读了一个文本文件的所有行,如下所示:

ifile = open('myfile.txt')
lines = ifile.readlines()

Now, suppose I have the following regular expression:现在,假设我有以下正则表达式:

rgx = re.compile(r'Found ([0-9]+) solutions')

I can use我可以用

result = filter(rgx.match,lines)
print result

to obtain a list of matches, but what I want is a list of matching groups.获取匹配列表,但我想要的是匹配组列表。 For example, instead of output like:例如,而不是像这样的输出:

Found 3 solutions
Found 35 solutions
Found 0 solutions

I want output like:我想要这样的输出:

3
35
0

How can I do this?我怎样才能做到这一点?

import re

rgx = re.compile(r'Found ([0-9]+) solutions')

with open('myfile.txt') as f:
    result = [m.group(1) for m in (rgx.match(line) for line in f) if m]

The inner loop, (rgx.match(line) for line in f) is a generator expression that acts like apply() .内部循环(rgx.match(line) for line in f)是一个生成器表达式,其作用类似于apply() For each line in the file, it calls rgx.match() and yields up the result, an SRE_Match object (I usually just call it a "match object").对于文件中的每一行,它调用rgx.match()并产生结果,一个SRE_Match对象(我通常称之为“匹配对象”)。

The outer loop has if m which discards any result that does not evaluate true ( re.match() returns None when the pattern doesn't match).外部循环具有if m ,它丢弃任何不评估为 true 的re.match()当模式不匹配时, re.match()返回None )。 Then m.group(1) uses the match object to get the text from inside the parentheses.然后m.group(1)使用匹配对象从括号内获取文本。 See the documentation for the re module for details.有关详细信息,请参阅re模块的文档。 Since the outer loop is part of a list comprehension, a list of results is built and returned.由于外部循环是列表推导式的一部分,因此会构建并返回结果列表。

Since the prefix and suffix are fixed string, you can use look-around:由于前缀和后缀是固定字符串,您可以使用环视:

r'(?<=Found )\d+(?= solutions)'

I think there should be some way to use your regular expression to do the job, though.不过,我认为应该有某种方法可以使用正则表达式来完成这项工作。

You get "match" objects back from the match command (unless you implicitly turn it into a string using filter), alas.您可以从 match 命令返回“匹配”对象(除非您使用过滤器将其隐式转换为字符串),唉。 there isn't decent documentation available via.没有像样的文档可以通过。 ipython help but it is online: http://docs.python.org/3/library/re.html#match-objects ipython 帮助但它在线: http ://docs.python.org/3/library/re.html#match-objects

Eg.例如。

for line in lines:
  result = rgx.match(line)
  if not result: continue
  print result.group(1)
print '\n'.join([m.group(1) for l in lines for m in [rgx.search(l)] if m])

So the other solutions offered here are fine and probably the most readable, but in the specific example of your needs, I would suggest there are a couple of one-line alternatives (bearing in mind of course that your question was from 2013 and you probably don't work at the same company, let alone work on the same project).所以这里提供的其他解决方案很好,可能是最易读的,但是在您需要的具体示例中,我建议有几个单行替代方案(当然要记住,您的问题来自 2013 年,您可能不要在同一家公司工作,更不用说在同一个项目上工作了)。 I also think this is of some general interest if anyone finds themself here.我还认为,如果有人在这里发现自己,这会引起普遍的兴趣。 Because your premise is very simple (one interesting piece of data on each line), You can do the following:因为您的前提非常简单(每行一个有趣的数据),您可以执行以下操作:

>>> # simulate reading the (hopefully not ginormous) file into a single string
>>> lines = "Found 3 solutions\nFound 35 solutions\nFound 0 solutions\n"
>>> # we're now in the state we would be after "lines = file.readlines()"
>>> print(lines)
Found 3 solutions
Found 35 solutions
Found 0 solutions

>>> # we're so constrained, we can get away with murder in a single line
>>> solution_counts = re.findall(r'\d+', file_contents)
>>> solution_counts
['3', '35', '0']
>>> # bazinga!

This is a surprisingly robust solution.这是一个令人惊讶的强大解决方案。 If your file is localized in a way that changes the words "found" and "solutions" to translated equivalents, this solution doesn't care, as long as the formatting remains the same.如果您的文件的本地化方式将“找到”和“解决方案”这两个词更改为翻译后的等价词,则只要格式保持不变,此解决方案就无关紧要。 Headers and footers that don't contain decimal integers?不包含十进制整数的页眉和页脚? Doesn't care.不在乎。 It could work on a single string like "Found solution sets of count 3, 35, and 0" The exact same code will extract the answer you want.它可以处理单个字符串,例如"Found solution sets of count 3, 35, and 0" 。完全相同的代码将提取您想要的答案。 However, it's more common that you know the format, but can't control it, and each line/record is full of heterogeneous data and that the section you care about is surrounded by others that you may or may not care bout.但是,更常见的是您知道格式,但无法控制它,并且每行/记录都充满异构数据,并且您关心的部分被其他您可能关心或可能不关心的部分包围。 So consider the wacky variant below:因此,请考虑以下古怪的变体:

file_contents = "99 bottles of beer on the wall\n" \
                "50 ways to leave your lover\n" \
                "6 kinds of scary\n" \
                "Found 3 solutions of type A\n" \
                "Found 35 solutions of type C\n" \
                "Found 4 solutions of unknown type\n" \
                "2 heads are better than 1\n" \
                "etc, ...\n"

Our naive solution will return ['99', '50', '6', '3', '35', '4', '2', '1'] , which is not all that interesting unless you know how to filter out the extraneous data, so confusing, error-prone, and fragile - 1 star out of five.我们天真的解决方案将返回['99', '50', '6', '3', '35', '4', '2', '1'] ,除非您知道如何,否则这并不是那么有趣过滤掉无关的数据,如此混乱、容易出错和脆弱——五颗星中的一颗。 It would be easy, and probably nice clean solution involving iterating over the lines in stead of ingesting the whole stream of bytes into memory, but let's stick with the assumption that we have to for some reason.这会很容易,而且可能是很好的干净解决方案,涉及迭代行而不是将整个字节流摄取到内存中,但让我们坚持假设我们出于某种原因必须这样做。 Maybe it doesn't come from a file (captured from TCPIP stream or whatever. Using another one-liner, lines.split('\\n') , we get the lines separated again (without the newlines), and can iterate and do comprehensions, etc., but we could also skip right to it using finditer也许它不是来自文件(从 TCPIP 流或其他文件中捕获。使用另一个单行, lines.split('\\n') ,我们再次将行分开(没有换行符),并且可以迭代并执行理解等,但我们也可以使用finditer跳到它

>>> [ m.group(1) for m in re.finditer(r'Found (\d+)', file_contents) ]
>>> ['3', '35', '4']

Pretty robust.相当健壮。 I'm not even sure it's faster to pre-compile unless you're processing lots nightmare files.我什至不确定预编译是否更快,除非您正在处理大量噩梦文件。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM