How to parse a large malformed HTML page, in Python?

Question

I'm trying to parse a large HTML page with a malformed table markup. There are around 7000-10000 rows in the table. The problem is that none of the tr , th , td is closed. So, the markup is like this:

<HTML>
<HEAD>
</HEAD>
<BODY>

<center>

    <table border = 1>
        <tr height=40><th colspan = 16><font size=4>Dummy content
        <tr><th>A
            <th>B
            <th>C
            <th>D
            <th>E
            <th>F
            <th>G


        <tr><td>A
            <td>B
            <td>C
            <td>D
            <td>E
        <tr><td>A
            <td>B
            <td>C
            <td>D
            <td>E
    .........
    .........

    </table>
    </center>
    </BODY>
    </HTML>

I tried BeautifulSoup.prettify() to fix it, but BeautifulSoup runs in to a maximum recursion depth error. Also tried with lxml, as follows:

from lxml import html
root = html.fromstring(htmltext)
print len(root.find('.//tr'))

But it returns a length of around 50, where there are actually above 7000 tr 's.

Is there a good way to parse the HTML and extract content for each row?

Answer 1

I hope you are looking for something like this.

import re
trs = re.findall(r'(?<=<tr>).*?(?=<tr>)', your_string, re.DOTALL)
print trs

this regex will return everything between two tr labels. if you want to search between two other labels, just change the first tr and the second tr to the thing you need.

i ran a little test and it worked for me, let me know if it helped you.

Answer 2

I'd suggest trying the HTMLParser module. I just wrote some code that uses it, and I couldn't test my "except HTMLParser.HTMLParseError" block because I couldn't devise input that would make the parser fail!

How to parse a large malformed HTML page, in Python?

Question

2 answers

solution1
1 ACCPTED 2015-07-17 13:07:44

solution2
1 2015-07-17 23:45:49

How to parse a large malformed HTML page, in Python?

Question

2 answers

solution1 1 ACCPTED 2015-07-17 13:07:44

solution2 1 2015-07-17 23:45:49

solution1
1 ACCPTED 2015-07-17 13:07:44

solution2
1 2015-07-17 23:45:49