简体   繁体   English

在python正则表达式中匹配多行

[英]matching multiple line in python regular expression

I want to extract the data between <tr> tags from an html page. 我想从html页面中提取<tr>标签之间的数据。 I used the following code.But i didn't get any result. 我使用了以下代码。但我没有得到任何结果。 The html between the <tr> tags is in multiple lines <tr>标签之间的html是多行的

category =re.findall('<tr>(.*?)</tr>',data);

Please suggest a fix for this problem. 请建议修复此问题。

just to clear up the issue. 只是为了解决这个问题。 Despite all those links to re.M it wouldn't work here as simple skimming of the its explanation would reveal. 尽管与re.M有这些联系, re.M它在这里不起作用,因为它的解释会简单地略读。 You'd need re.S , if you wouldn't try to parse html, of course: 你需要re.S ,如果你不想尝试解析html,当然:

>>> doc = """<table border="1">
    <tr>
        <td>row 1, cell 1</td>
        <td>row 1, cell 2</td>
    </tr>
    <tr>
        <td>row 2, cell 1</td>
        <td>row 2, cell 2</td>
    </tr>
</table>"""

>>> re.findall('<tr>(.*?)</tr>', doc, re.S)
['\n        <td>row 1, cell 1</td>\n        <td>row 1, cell 2</td>\n    ', 
 '\n        <td>row 2, cell 1</td>\n        <td>row 2, cell 2</td>\n    ']
>>> re.findall('<tr>(.*?)</tr>', doc, re.M)
[]

Don't use regex, use a HTML parser such as BeautifulSoup : 不要使用正则表达式,使用HTML解析器,如BeautifulSoup

html = '<html><body>foo<tr>bar</tr>baz<tr>qux</tr></body></html>'

import BeautifulSoup
soup = BeautifulSoup.BeautifulSoup(html)
print soup.findAll("tr")

Result: 结果:

[<tr>bar</tr>, <tr>qux</tr>]

If you just want the contents, without the tr tags: 如果你只想要内容,没有tr标签:

for tr in soup.findAll("tr"):
    print tr.contents

Result: 结果:

bar
qux

Using an HTML parser isn't as scary as it sounds! 使用HTML解析器并不像听起来那么可怕! And it will work more reliably than any regex that will be posted here. 并且它将比将在此处发布的任何正则表达式更可靠地工作。

Do not use regular expressions to parse HTML. 不要使用正则表达式来解析HTML。 Use an HTML parser such as lxml or BeautifulSoup . 使用HTML解析器,例如lxmlBeautifulSoup

pat=re.compile('<tr>(.*?)</tr>',re.DOTALL|re.M)
print pat.findall(data)

Or non regex way, 或非正则表达方式,

for item in data.split("</tr>"):
    if "<tr>" in item:
       print item[item.find("<tr>")+len("<tr>"):]

As other have suggested the specific problem you are having can be resolved by allowing multi-line matching using re.MULTILINE 正如其他人所说,通过允许使用re.MULTILINE进行多行匹配 ,可以解决您遇到的具体问题

However you are going down a treacherous patch parsing HTML with regular expressions . 但是, 你正在寻找一个使用正则表达式解析HTML的危险补丁 Use an XML/HTML parser instead, BeautifulSoup works great for this! 使用XML / HTML解析器, BeautifulSoup非常适合这个!

doc = """<table border="1">
    <tr>
        <td>row 1, cell 1</td>
        <td>row 1, cell 2</td>
    </tr>
    <tr>
        <td>row 2, cell 1</td>
        <td>row 2, cell 2</td>
    </tr>
</table>"""

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(doc)
all_trs = soup.findAll("tr")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM