Python和lxml.html get_element_by_id輸出問題

Question

我目前正在嘗試從html文件中獲取數據。 看來我正在使用的代碼可以正常工作，但不像我期望的那樣。 我可以得到一些項目，但不是全部，我想知道是否與我嘗試讀取的文件大小有關。

我目前正在嘗試解析此網頁的來源。

此頁面長4500行，因此尺寸相當不錯。 我一直在使用此頁面，因為我想確保代碼可以在大文件上使用。

我使用的代碼是：

import lxml.html
import lxml
import urllib2

webHTML = urllib2.urlopen('http://hobbyking.com/hobbyking/store/__39036__Turnigy_Multistar_2213_980Kv_14Pole_Multi_Rotor_Outrunner.html').read()
webHTML = lxml.html.fromstring(webHTML)
productDetails = webHTML.get_element_by_id('productDetails')
for element in productDetails:
    print element.text_content()

當我使用'mm3'的element_id或頂部附近的東西時，這將提供預期的輸出，但是如果我使用'productDetails'的element_id，則沒有輸出。 至少我在當前設置上做了。

Answer 1

恐怕lxml.html無法處理解析此特定HTML源。 它將帶有id="productDetails"的h3標記解析為一個空元素（這是默認的“ recover”模式）：

<h3 class="productDescription2" id="productDetails" itemprop="description"></h3>

使用html5lib解析器切換到BeautifulSoup （ 非常寬松 ）：

from urllib2 import urlopen
from bs4 import BeautifulSoup

url = 'http://hobbyking.com/hobbyking/store/__39036__Turnigy_Multistar_2213_980Kv_14Pole_Multi_Rotor_Outrunner.html'
soup = BeautifulSoup(urlopen(url), 'html5lib')

for element in soup.find(id='productDetails').find_all():
    print element.text

打印：

Looking for the ultimate power system for your next Multi-rotor project? Look no further!The Turnigy Multistar outrunners are designed with one thing in mind - maximising Multi-rotor performance! They feature high-end magnets, high quality bearings and all are precision balanced for smooth running, these motors are engineered specifically for multi-rotor use.These include a prop adapter and have a built in aluminium mount for quick and easy installation on your multi-rotor frame.

outrunner

...

Python和lxml.html get_element_by_id輸出問題

問題描述

1 個解決方案

解決方案1
1 已采納 2014-12-26 07:23:43

Python和lxml.html get_element_by_id輸出問題

問題描述

1 個解決方案

解決方案1 1 已采納 2014-12-26 07:23:43

解決方案1
1 已采納 2014-12-26 07:23:43