matching multiple line in python regular expression

Question

I want to extract the data between <tr> tags from an html page. I used the following code.But i didn't get any result. The html between the <tr> tags is in multiple lines

category =re.findall('<tr>(.*?)</tr>',data);

Please suggest a fix for this problem.

Answer 1

just to clear up the issue. Despite all those links to re.M it wouldn't work here as simple skimming of the its explanation would reveal. You'd need re.S , if you wouldn't try to parse html, of course:

>>> doc = """<table border="1">
    <tr>
        <td>row 1, cell 1</td>
        <td>row 1, cell 2</td>
    </tr>
    <tr>
        <td>row 2, cell 1</td>
        <td>row 2, cell 2</td>
    </tr>
</table>"""

>>> re.findall('<tr>(.*?)</tr>', doc, re.S)
['\n        <td>row 1, cell 1</td>\n        <td>row 1, cell 2</td>\n    ', 
 '\n        <td>row 2, cell 1</td>\n        <td>row 2, cell 2</td>\n    ']
>>> re.findall('<tr>(.*?)</tr>', doc, re.M)
[]

Answer 2

Don't use regex, use a HTML parser such as BeautifulSoup :

html = '<html><body>foo<tr>bar</tr>baz<tr>qux</tr></body></html>'

import BeautifulSoup
soup = BeautifulSoup.BeautifulSoup(html)
print soup.findAll("tr")

Result:

[<tr>bar</tr>, <tr>qux</tr>]

If you just want the contents, without the tr tags:

for tr in soup.findAll("tr"):
    print tr.contents

Result:

bar
qux

Using an HTML parser isn't as scary as it sounds! And it will work more reliably than any regex that will be posted here.

Answer 3

Do not use regular expressions to parse HTML. Use an HTML parser such as lxml or BeautifulSoup .

Answer 4

pat=re.compile('<tr>(.*?)</tr>',re.DOTALL|re.M)
print pat.findall(data)

Or non regex way,

for item in data.split("</tr>"):
    if "<tr>" in item:
       print item[item.find("<tr>")+len("<tr>"):]

Answer 5

As other have suggested the specific problem you are having can be resolved by allowing multi-line matching using re.MULTILINE

However you are going down a treacherous patch parsing HTML with regular expressions . Use an XML/HTML parser instead, BeautifulSoup works great for this!

doc = """<table border="1">
    <tr>
        <td>row 1, cell 1</td>
        <td>row 1, cell 2</td>
    </tr>
    <tr>
        <td>row 2, cell 1</td>
        <td>row 2, cell 2</td>
    </tr>
</table>"""

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(doc)
all_trs = soup.findAll("tr")

matching multiple line in python regular expression

Question

5 answers

solution1
17 ACCPTED

solution2
5 2010-02-04 12:36:33

solution3
2 2010-02-04 12:24:20

solution4
2 2010-02-04 12:33:48

solution5
0 2010-02-04 12:45:54

matching multiple line in python regular expression

Question

5 answers

solution1 17 ACCPTED

solution2 5 2010-02-04 12:36:33

solution3 2 2010-02-04 12:24:20

solution4 2 2010-02-04 12:33:48

solution5 0 2010-02-04 12:45:54

solution1
17 ACCPTED

solution2
5 2010-02-04 12:36:33

solution3
2 2010-02-04 12:24:20

solution4
2 2010-02-04 12:33:48

solution5
0 2010-02-04 12:45:54