简体   繁体   English

美丽的汤findAll不计算所有div

[英]Beautiful Soup findAll not counting all divs

from bs4 import BeautifulSoup

html = 'index.html'
soup = BeautifulSoup(open(html))
print len(soup.findAll('div'))

where the file index.html is the source code of this shopping webpage . 其中index.html文件是此购物网页的源代码。

My code shows that only 1 div tag was found. 我的代码显示仅找到1个div标签。 But what's weirder is findAll('a') returns a huge (so probably correct) number. 但是奇怪的是findAll('a')返回一个巨大的(所以可能是正确的)数字。 span works etc, but not div . span工程等,但不是div

You are experiencing the differences between parsers that BeautifulSoup uses under-the-hood. 您正在体验BeautifulSoup 在后台使用的解析器之间差异

Choose either html.parser , or html5lib : 选择html.parserhtml5lib

>>> from bs4 import BeautifulSoup
>>> html = 'index.html'
>>> soup = BeautifulSoup(open(html), 'html')
>>> len(soup.findAll('div'))
0
>>> soup = BeautifulSoup(open(html), 'lxml')
>>> len(soup.findAll('div'))
0
>>> soup = BeautifulSoup(open(html), 'html.parser')
>>> len(soup.findAll('div'))
774
>>> soup = BeautifulSoup(open(html), 'html5lib')
>>> Alen(soup.findAll('div'))
774

Note that if you don't specify a parser , BeautifulSoup would pick it up automatically: 请注意,如果您未指定解析器BeautifulSoup会自动将其拾取:

If you don't specify anything, you'll get the best HTML parser that's installed. 如果不指定任何内容,则将获得已安装的最佳HTML解析器。 Beautiful Soup ranks lxml's parser as being the best, then html5lib's, then Python's built-in parser. Beautiful Soup将lxml的解析器评为最佳,然后是html5lib的解析器,然后是Python的内置解析器。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM