简体   繁体   English

BeautifulSoup / lxml:大元素有问题吗?

[英]BeautifulSoup / lxml: Are there problems with large elements?

import os, re, sys, urllib2
from bs4 import BeautifulSoup
import lxml

html = urllib2.urlopen("http://www.hoerzu.de/tv-programm/jetzt/")
soup = BeautifulSoup(html, "lxml")
divs = soup.find_all("div", {"class":"block"})
print len(divs)

Output: 输出:

ActivePython 2.7.2.5 (ActiveState Software Inc.) based on
Python 2.7.2 (default, Jun 24 2011, 12:21:10) [MSC v.1500 32 bit (Intel)] on win
32
Type "help", "copyright", "credits" or "license" for more information.
>>> import os, re, sys, urllib2
>>> from bs4 import BeautifulSoup
>>> import lxml
>>>
>>> html = urllib2.urlopen("http://www.hoerzu.de/tv-programm/jetzt/")
>>> soup = BeautifulSoup(html, "lxml")
>>> divs = soup.find_all("div", {"class":"block"})
>>> print len(divs)
2

I also tried: 我也尝试过:

divs = soup.find_all(class_="block")

with same result ... 结果相同......

But there are 11 elements that fit this condition. 但是有11种元素适合这种情况。 So are there any limitations such as max element size resp. 那么有任何限制,如最大元素大小resp。 how can I get all the elements? 我怎样才能获得所有元素?

The easiest way is probably using the 'html.parser' instead of 'lxml': 最简单的方法可能是使用'html.parser'而不是'lxml':

import os, re, sys, urllib2
from bs4 import BeautifulSoup
import lxml

html = urllib2.urlopen("http://www.hoerzu.de/tv-programm/jetzt/")
soup = BeautifulSoup(html, "html.parser")
divs = soup.find_all("div", {"class":"block"})
print len(divs)

With your original code (using lxml ) it printed 1 for me, but this prints 11 . 使用原始代码(使用lxml ),它为我打印1 ,但这打印11 lxml is lenient but not as lenient as html.parser for this page. lxml是宽松的,但不像此页面的html.parser那样宽松。

Please note that the page has over one thousand warnings if you run it through tidy . 请注意,如果您通过tidy运行该页面有超过一千个警告。 Including invalid character codes, unclosed <div> s, letters like < and / at positions they are not allowed. 包括无效的字符代码,未关闭的<div> s,像</这样的字母不允许使用。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM