[英]Beautiful Soup findAll not counting all divs
from bs4 import BeautifulSoup
html = 'index.html'
soup = BeautifulSoup(open(html))
print len(soup.findAll('div'))
where the file index.html
is the source code of this shopping webpage . 其中
index.html
文件是此购物网页的源代码。
My code shows that only 1 div
tag was found. 我的代码显示仅找到1个
div
标签。 But what's weirder is findAll('a')
returns a huge (so probably correct) number. 但是奇怪的是
findAll('a')
返回一个巨大的(所以可能是正确的)数字。 span
works etc, but not div
. span
工程等,但不是div
。
You are experiencing the differences between parsers that BeautifulSoup
uses under-the-hood. 您正在体验
BeautifulSoup
在后台使用的解析器之间的差异 。
Choose either html.parser
, or html5lib
: 选择
html.parser
或html5lib
:
>>> from bs4 import BeautifulSoup
>>> html = 'index.html'
>>> soup = BeautifulSoup(open(html), 'html')
>>> len(soup.findAll('div'))
0
>>> soup = BeautifulSoup(open(html), 'lxml')
>>> len(soup.findAll('div'))
0
>>> soup = BeautifulSoup(open(html), 'html.parser')
>>> len(soup.findAll('div'))
774
>>> soup = BeautifulSoup(open(html), 'html5lib')
>>> Alen(soup.findAll('div'))
774
Note that if you don't specify a parser , BeautifulSoup
would pick it up automatically: 请注意,如果您未指定解析器 ,
BeautifulSoup
会自动将其拾取:
If you don't specify anything, you'll get the best HTML parser that's installed.
如果不指定任何内容,则将获得已安装的最佳HTML解析器。 Beautiful Soup ranks lxml's parser as being the best, then html5lib's, then Python's built-in parser.
Beautiful Soup将lxml的解析器评为最佳,然后是html5lib的解析器,然后是Python的内置解析器。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.