简体   繁体   English

使用正则表达式查找两个字符串之间的所有匹配项

[英]Find all matches between two strings with regex

I am just starting to use regex for the first time and am trying to use it to parse some data from an HTML table. 我刚开始第一次使用正则表达式,并试图使用它来解析HTML表中的某些数据。 I am trying to grab everything between the <tr > and </tr> tags, and then make a similar regex again to create a JSON array. 我试图获取<tr ></tr>标记之间的所有内容,然后再次进行类似的正则表达式来创建JSON数组。

I tried using this but it only is matching to the first group and not all of the rest. 我尝试使用此方法,但它仅与第一组匹配,而与其余所有组都不匹配。

<tr >(.*?)</tr>

How do I make that find all matches between those tags? 如何找到所有这些标记之间的匹配项?

Although using regex for this job is a bad idea (there are many ways for things to go wrong), your pattern is basically correct. 尽管将正则表达式用于此工作是个坏主意(有很多方法可以解决问题),但是您的模式基本上是正确的。

Returning All Matches with Python 使用Python返回所有匹配项

The question then becomes about returning all matches or capture groups in Python. 问题就变成了在Python中返回所有匹配项或捕获组。 There are two basic ways: 有两种基本方法:

  1. finditer finditer
  2. findall 找到所有

With finditer 与发现者

for match in regex.finditer(subject):
    print("The Overall Match: ", match.group(0))
    print("Group 1: ", match.group(1))

With findall 与findall

findall is a bit strange. findall有点奇怪。 When you have capture groups, to access both the capture groups and the overall match, you have to wrap your original regex in parentheses (so that the overall match is captured too). 当您有捕获组时,要访问捕获组和整体匹配项,必须将原始正则表达式包装在括号中(以便也捕获整体匹配项)。 In your case, if you wanted to be able to access both the outside of the tags and the inside (which you captured with Group 1), your regex would become: (<tr >(.*?)</tr>) . 在您的情况下,如果希望同时访问标签的外部和内部(使用组1捕获),则正则表达式将变为: (<tr >(.*?)</tr>) Then you do: 然后,您执行以下操作:

matches = regex.findall(subject)
if len(matches)>0:
    for match in matches:
        print ("The Overall Match: ",match[0])
        print ("Group 1: ",match[1])

It works for me, perhaps you need to use findall , or perhaps you're not using a raw string? 它对我有用,也许您需要使用findall ,或者您不使用原始字符串?

import re

txt = '''<tr >foo</tr><tr >bar

</tr>

<tr >baz</tr>'''

# Be sure to use the DOTALL flag so the newlines are matched by the dot as well.
re.findall(r'<tr >(.*?)</tr>', txt, re.DOTALL)

returns 回报

['foo', 'bar\n\n', 'baz']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM