[英]Python and lxml.html get_element_by_id output questions
I'm currently trying to get data from an html file. 我目前正在尝试从html文件中获取数据。 It appears that the code I'm using works, but not as I expect. 看来我正在使用的代码可以正常工作,但不像我期望的那样。 I can get some items but not all and I'm wondering if it has to do with the size of the file I'm attempting to read. 我可以得到一些项目,但不是全部,我想知道是否与我尝试读取的文件大小有关。
I'm currently trying to parse the source of this webpage . 我目前正在尝试解析此网页的来源。
This page is 4500 lines long so it is a pretty good size. 此页面长4500行,因此尺寸相当不错。 I've been using this page as I'd like to make sure the code works on large files. 我一直在使用此页面,因为我想确保代码可以在大文件上使用。
The code I'm using is: 我使用的代码是:
import lxml.html
import lxml
import urllib2
webHTML = urllib2.urlopen('http://hobbyking.com/hobbyking/store/__39036__Turnigy_Multistar_2213_980Kv_14Pole_Multi_Rotor_Outrunner.html').read()
webHTML = lxml.html.fromstring(webHTML)
productDetails = webHTML.get_element_by_id('productDetails')
for element in productDetails:
print element.text_content()
This gives the expected output when I use an element_id of 'mm3' or something near the top but if I use the element_id of 'productDetails' I get no output. 当我使用'mm3'的element_id或顶部附近的东西时,这将提供预期的输出,但是如果我使用'productDetails'的element_id,则没有输出。 At least I do on my current setup. 至少我在当前设置上做了。
I'm afraid lxml.html
cannot handle parsing this particular HTML source. 恐怕lxml.html
无法处理解析此特定HTML源。 It parses the h3
tag with id="productDetails"
as an empty element (and this is in a default "recover" mode ): 它将带有id="productDetails"
的h3
标记解析为一个空元素(这是默认的“ recover”模式 ):
<h3 class="productDescription2" id="productDetails" itemprop="description"></h3>
Switch to BeautifulSoup
with html5lib
parser (it is extremely lenient ): 使用html5lib
解析器切换到BeautifulSoup
( 非常宽松 ):
from urllib2 import urlopen
from bs4 import BeautifulSoup
url = 'http://hobbyking.com/hobbyking/store/__39036__Turnigy_Multistar_2213_980Kv_14Pole_Multi_Rotor_Outrunner.html'
soup = BeautifulSoup(urlopen(url), 'html5lib')
for element in soup.find(id='productDetails').find_all():
print element.text
Prints: 打印:
Looking for the ultimate power system for your next Multi-rotor project? Look no further!The Turnigy Multistar outrunners are designed with one thing in mind - maximising Multi-rotor performance! They feature high-end magnets, high quality bearings and all are precision balanced for smooth running, these motors are engineered specifically for multi-rotor use.These include a prop adapter and have a built in aluminium mount for quick and easy installation on your multi-rotor frame.
outrunner
...
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.