在python正則表達式中匹配多行

Question

我想從html頁面中提取<tr>標簽之間的數據。 我使用了以下代碼。但我沒有得到任何結果。 <tr>標簽之間的html是多行的

category =re.findall('<tr>(.*?)</tr>',data);

請建議修復此問題。

Answer 1

只是為了解決這個問題。 盡管與re.M有這些聯系， re.M它在這里不起作用，因為它的解釋會簡單地略讀。 你需要re.S ，如果你不想嘗試解析html，當然：

>>> doc = """<table border="1">
    <tr>
        <td>row 1, cell 1</td>
        <td>row 1, cell 2</td>
    </tr>
    <tr>
        <td>row 2, cell 1</td>
        <td>row 2, cell 2</td>
    </tr>
</table>"""

>>> re.findall('<tr>(.*?)</tr>', doc, re.S)
['\n        <td>row 1, cell 1</td>\n        <td>row 1, cell 2</td>\n    ', 
 '\n        <td>row 2, cell 1</td>\n        <td>row 2, cell 2</td>\n    ']
>>> re.findall('<tr>(.*?)</tr>', doc, re.M)
[]

Answer 2

不要使用正則表達式，使用HTML解析器，如BeautifulSoup ：

html = '<html><body>foo<tr>bar</tr>baz<tr>qux</tr></body></html>'

import BeautifulSoup
soup = BeautifulSoup.BeautifulSoup(html)
print soup.findAll("tr")

結果：

[<tr>bar</tr>, <tr>qux</tr>]

如果你只想要內容，沒有tr標簽：

for tr in soup.findAll("tr"):
    print tr.contents

結果：

bar
qux

使用HTML解析器並不像聽起來那么可怕！ 並且它將比將在此處發布的任何正則表達式更可靠地工作。

Answer 3

不要使用正則表達式來解析HTML。 使用HTML解析器，例如lxml或BeautifulSoup 。

Answer 4

pat=re.compile('<tr>(.*?)</tr>',re.DOTALL|re.M)
print pat.findall(data)

或非正則表達方式，

for item in data.split("</tr>"):
    if "<tr>" in item:
       print item[item.find("<tr>")+len("<tr>"):]

Answer 5

正如其他人所說，通過允許使用re.MULTILINE進行多行匹配 ，可以解決您遇到的具體問題

但是， 你正在尋找一個使用正則表達式解析HTML的危險補丁 。 使用XML / HTML解析器， BeautifulSoup非常適合這個！

doc = """<table border="1">
    <tr>
        <td>row 1, cell 1</td>
        <td>row 1, cell 2</td>
    </tr>
    <tr>
        <td>row 2, cell 1</td>
        <td>row 2, cell 2</td>
    </tr>
</table>"""

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(doc)
all_trs = soup.findAll("tr")

在python正則表達式中匹配多行

問題描述

5 個解決方案

解決方案1
17 已采納

解決方案2
5 2010-02-04 12:36:33

解決方案3
2 2010-02-04 12:24:20

解決方案4
2 2010-02-04 12:33:48

解決方案5
0 2010-02-04 12:45:54

在python正則表達式中匹配多行

問題描述

5 個解決方案

解決方案1 17 已采納

解決方案2 5 2010-02-04 12:36:33

解決方案3 2 2010-02-04 12:24:20

解決方案4 2 2010-02-04 12:33:48

解決方案5 0 2010-02-04 12:45:54

解決方案1
17 已采納

解決方案2
5 2010-02-04 12:36:33

解決方案3
2 2010-02-04 12:24:20

解決方案4
2 2010-02-04 12:33:48

解決方案5
0 2010-02-04 12:45:54