I am trying to parse a large html document using the Python Beautiful Soup 4 library.
The page contains a very large table, structured like so:
<table summary='foo'>
<tbody>
<tr>
A bunch of data
</tr>
<tr>
More data
</tr>
.
.
.
100s of <tr> tags later
</tbody>
</table>
I have a function that evaluates whether a given tag in soup.descendants
is of the kind I am looking for. This is necessary because the page is large (BeautifulSoup tells me the document contains around 4000 tags). It is like so:
def isrow(tag):
if tag.name == u'tr':
if tag.parent.parent.name == u'table' and \
tag.parent.parent.has_attr('summary'):
return True
My problem is that when I iterate through soup.descendants
, the function only returns True
for the first 77 rows in the table, when I know that the <tr>
tags continue on for hundreds of rows.
Is this a problem with my function or is there something I don't understand about how BeautifulSoup generates its collection of descendants? I suspect it might be a Python or a bs4 memory issue but I don't know how to go about troubleshooting it.
Still more like an educated guess, but I'll give it a try.
The way BeautifulSoup
parses HTML heavily depends on the underlying parser . If you don't specify it explicitly , BeautifulSoup
will choose the one automatically based on an internal ranking:
If you don't specify anything, you'll get the best HTML parser that's installed. Beautiful Soup ranks lxml's parser as being the best, then html5lib's, then Python's built-in parser.
In your case, I'd try to switch the parsers and see what results you would get:
soup = BeautifulSoup(data, "lxml") # needs lxml to be installed
soup = BeautifulSoup(data, "html5lib") # needs html5lib to be installed
soup = BeautifulSoup(data, "html.parser") # uses built-in html.parser
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.