简体   繁体   中英

matching multiple line in python regular expression

I want to extract the data between <tr> tags from an html page. I used the following code.But i didn't get any result. The html between the <tr> tags is in multiple lines

category =re.findall('<tr>(.*?)</tr>',data);

Please suggest a fix for this problem.

just to clear up the issue. Despite all those links to re.M it wouldn't work here as simple skimming of the its explanation would reveal. You'd need re.S , if you wouldn't try to parse html, of course:

>>> doc = """<table border="1">
    <tr>
        <td>row 1, cell 1</td>
        <td>row 1, cell 2</td>
    </tr>
    <tr>
        <td>row 2, cell 1</td>
        <td>row 2, cell 2</td>
    </tr>
</table>"""

>>> re.findall('<tr>(.*?)</tr>', doc, re.S)
['\n        <td>row 1, cell 1</td>\n        <td>row 1, cell 2</td>\n    ', 
 '\n        <td>row 2, cell 1</td>\n        <td>row 2, cell 2</td>\n    ']
>>> re.findall('<tr>(.*?)</tr>', doc, re.M)
[]

Don't use regex, use a HTML parser such as BeautifulSoup :

html = '<html><body>foo<tr>bar</tr>baz<tr>qux</tr></body></html>'

import BeautifulSoup
soup = BeautifulSoup.BeautifulSoup(html)
print soup.findAll("tr")

Result:

[<tr>bar</tr>, <tr>qux</tr>]

If you just want the contents, without the tr tags:

for tr in soup.findAll("tr"):
    print tr.contents

Result:

bar
qux

Using an HTML parser isn't as scary as it sounds! And it will work more reliably than any regex that will be posted here.

Do not use regular expressions to parse HTML. Use an HTML parser such as lxml or BeautifulSoup .

pat=re.compile('<tr>(.*?)</tr>',re.DOTALL|re.M)
print pat.findall(data)

Or non regex way,

for item in data.split("</tr>"):
    if "<tr>" in item:
       print item[item.find("<tr>")+len("<tr>"):]

As other have suggested the specific problem you are having can be resolved by allowing multi-line matching using re.MULTILINE

However you are going down a treacherous patch parsing HTML with regular expressions . Use an XML/HTML parser instead, BeautifulSoup works great for this!

doc = """<table border="1">
    <tr>
        <td>row 1, cell 1</td>
        <td>row 1, cell 2</td>
    </tr>
    <tr>
        <td>row 2, cell 1</td>
        <td>row 2, cell 2</td>
    </tr>
</table>"""

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(doc)
all_trs = soup.findAll("tr")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM