简体   繁体   中英

How to parse a large malformed HTML page, in Python?

I'm trying to parse a large HTML page with a malformed table markup. There are around 7000-10000 rows in the table. The problem is that none of the tr , th , td is closed. So, the markup is like this:

<HTML>
<HEAD>
</HEAD>
<BODY>

<center>

    <table border = 1>
        <tr height=40><th colspan = 16><font size=4>Dummy content
        <tr><th>A
            <th>B
            <th>C
            <th>D
            <th>E
            <th>F
            <th>G


        <tr><td>A
            <td>B
            <td>C
            <td>D
            <td>E
        <tr><td>A
            <td>B
            <td>C
            <td>D
            <td>E
    .........
    .........

    </table>
    </center>
    </BODY>
    </HTML>

I tried BeautifulSoup.prettify() to fix it, but BeautifulSoup runs in to a maximum recursion depth error. Also tried with lxml, as follows:

from lxml import html
root = html.fromstring(htmltext)
print len(root.find('.//tr'))

But it returns a length of around 50, where there are actually above 7000 tr 's.

Is there a good way to parse the HTML and extract content for each row?

I hope you are looking for something like this.

import re
trs = re.findall(r'(?<=<tr>).*?(?=<tr>)', your_string, re.DOTALL)
print trs

this regex will return everything between two tr labels. if you want to search between two other labels, just change the first tr and the second tr to the thing you need.

i ran a little test and it worked for me, let me know if it helped you.

I'd suggest trying the HTMLParser module. I just wrote some code that uses it, and I couldn't test my "except HTMLParser.HTMLParseError" block because I couldn't devise input that would make the parser fail!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM