Python和lxml.html get_element_by_id输出问题

Question

I'm currently trying to get data from an html file. 我目前正在尝试从html文件中获取数据。 It appears that the code I'm using works, but not as I expect. 看来我正在使用的代码可以正常工作，但不像我期望的那样。 I can get some items but not all and I'm wondering if it has to do with the size of the file I'm attempting to read. 我可以得到一些项目，但不是全部，我想知道是否与我尝试读取的文件大小有关。

I'm currently trying to parse the source of this webpage . 我目前正在尝试解析此网页的来源。

This page is 4500 lines long so it is a pretty good size. 此页面长4500行，因此尺寸相当不错。 I've been using this page as I'd like to make sure the code works on large files. 我一直在使用此页面，因为我想确保代码可以在大文件上使用。

The code I'm using is: 我使用的代码是：

import lxml.html
import lxml
import urllib2

webHTML = urllib2.urlopen('http://hobbyking.com/hobbyking/store/__39036__Turnigy_Multistar_2213_980Kv_14Pole_Multi_Rotor_Outrunner.html').read()
webHTML = lxml.html.fromstring(webHTML)
productDetails = webHTML.get_element_by_id('productDetails')
for element in productDetails:
    print element.text_content()

This gives the expected output when I use an element_id of 'mm3' or something near the top but if I use the element_id of 'productDetails' I get no output. 当我使用'mm3'的element_id或顶部附近的东西时，这将提供预期的输出，但是如果我使用'productDetails'的element_id，则没有输出。 At least I do on my current setup. 至少我在当前设置上做了。

Answer 1

I'm afraid lxml.html cannot handle parsing this particular HTML source. 恐怕lxml.html无法处理解析此特定HTML源。 It parses the h3 tag with id="productDetails" as an empty element (and this is in a default "recover" mode ): 它将带有id="productDetails"的h3标记解析为一个空元素（这是默认的“ recover”模式）：

<h3 class="productDescription2" id="productDetails" itemprop="description"></h3>

Switch to BeautifulSoup with html5lib parser (it is extremely lenient ): 使用html5lib解析器切换到BeautifulSoup （ 非常宽松 ）：

from urllib2 import urlopen
from bs4 import BeautifulSoup

url = 'http://hobbyking.com/hobbyking/store/__39036__Turnigy_Multistar_2213_980Kv_14Pole_Multi_Rotor_Outrunner.html'
soup = BeautifulSoup(urlopen(url), 'html5lib')

for element in soup.find(id='productDetails').find_all():
    print element.text

Prints: 打印：

Looking for the ultimate power system for your next Multi-rotor project? Look no further!The Turnigy Multistar outrunners are designed with one thing in mind - maximising Multi-rotor performance! They feature high-end magnets, high quality bearings and all are precision balanced for smooth running, these motors are engineered specifically for multi-rotor use.These include a prop adapter and have a built in aluminium mount for quick and easy installation on your multi-rotor frame.

outrunner

...

Python和lxml.html get_element_by_id输出问题

问题描述

1 个解决方案

解决方案1
1 已采纳 2014-12-26 07:23:43

Python和lxml.html get_element_by_id输出问题

问题描述

1 个解决方案

解决方案1 1 已采纳 2014-12-26 07:23:43

解决方案1
1 已采纳 2014-12-26 07:23:43