简体   繁体   中英

Beautiful Soup filter function fails to find all rows of a table

I am trying to parse a large html document using the Python Beautiful Soup 4 library.

The page contains a very large table, structured like so:

<table summary='foo'>
    <tbody>
        <tr> 
            A bunch of data 
        </tr>
        <tr>
            More data 
        </tr>
        .
        .
        .
        100s of <tr> tags later
    </tbody>
</table>

I have a function that evaluates whether a given tag in soup.descendants is of the kind I am looking for. This is necessary because the page is large (BeautifulSoup tells me the document contains around 4000 tags). It is like so:

def isrow(tag):
    if tag.name == u'tr':
        if tag.parent.parent.name == u'table' and \
                tag.parent.parent.has_attr('summary'): 
            return True

My problem is that when I iterate through soup.descendants , the function only returns True for the first 77 rows in the table, when I know that the <tr> tags continue on for hundreds of rows.

Is this a problem with my function or is there something I don't understand about how BeautifulSoup generates its collection of descendants? I suspect it might be a Python or a bs4 memory issue but I don't know how to go about troubleshooting it.

Still more like an educated guess, but I'll give it a try.

The way BeautifulSoup parses HTML heavily depends on the underlying parser . If you don't specify it explicitly , BeautifulSoup will choose the one automatically based on an internal ranking:

If you don't specify anything, you'll get the best HTML parser that's installed. Beautiful Soup ranks lxml's parser as being the best, then html5lib's, then Python's built-in parser.

In your case, I'd try to switch the parsers and see what results you would get:

soup = BeautifulSoup(data, "lxml")  # needs lxml to be installed
soup = BeautifulSoup(data, "html5lib")  # needs html5lib to be installed
soup = BeautifulSoup(data, "html.parser")  # uses built-in html.parser

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM