简体   繁体   English

在Python lxml中刮取HTML表

[英]Scraping HTML table in Python lxml

The question may sound easy, but I am facing difficulty in solving it. 这个问题听起来很容易,但我在解决问题上面临困难。 I have a table like following: 我有一个如下表:

<table><tbody>
<tr>
<td>2003</td>
<td><span class="positive">1.19</span> </td>
<td><span class="negative">-0.48</span> </td>
</tr>

My code is following: 我的代码如下:

 from lxml import etree

 for elem in tree.xpath('//*[@id="printcontent"]/div[8]/div/table/tbody/tr'):
    for c in elem.xpath("//td"):
        if(c.getchildren()): # for the <span> thing
            text = c.xpath("//span/text()")
        else:
             text = c.text

But I am unable to iterate over the "td" elements. 但我无法迭代“td”元素。 I have been trying this whole day but of no avail!! 我一整天都在尝试,但无济于事!! I want to get 2003. 1.19, and -0.48. 我想得到2003. 1.19和-0.48。

Kindly help! 请帮忙!

It looks like you have HTML, not XML. 看起来你有HTML,而不是XML。 Therefore, use lxml.html, not lxml.etree to parse the data. 因此,使用lxml.html而不是lxml.etree来解析数据。 If data.html looks like this: 如果data.html看起来像这样:

<table><tbody>
<tr>
<td>2003</td>
<td><span class="positive">1.19</span> </td>
<td><span class="negative">-0.48</span> </td>
</tr>

then 然后

import lxml.html as LH
tree = LH.parse('data.html')
print([td.text_content() for td in tree.xpath('//td')])

yields 产量

['2003', '1.19 ', '-0.48 ']

If 如果

for elem in tree.xpath('//*[@id="printcontent"]/div[8]/div/table/tbody/tr'):

is not returning any elem s, then you need to show us enough HTML to help us debug why this XPath is not working. 没有返回任何elem ,那么你需要向我们展示足够的HTML来帮助我们调试为什么这个XPath不起作用。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM