python使用lxml解析html表

Question

I've an html table like this: 我有一个像这样的html表：

<TABLE>
<TR>
    <TD><P>Name</P></TD>
    <TD><P>Fees</P></TD>
    <TD><P>Awards</P></TD>
    <TD><P>Total</P></TD>
</TR>
<TR>
    <TD><P>Tony</P></TD>
    <TD >7,800</TD>
    <TD >7</TD>
    <TD>15,400</TD>
</TR>
<TR>
    <TD><P>Paul</FONT></P></TD>
    <TD >7,800</TD>
    <TD >7</TD>
    <TD>15,400</TD>
</TR>
<TR>
    <TD><P>Richard</P></TD>
    <TD >7,800</TD>
    <TD >7</TD>
    <TD>15,400</TD>
</TR>

</TR>
</TABLE>

I want to extract the values of table. 我想提取表的值。 I'd tried the following. 我试过以下。

import lxml.html
html = lxml.html.parse(''html_table)
text_value = html.xpath('//tr/td/text()')
packages = html.xpath('//tr/td/p')
p_content = [p.text_content() for p in packages]

is there any way to extract both the <p> text and the text of <td> to a single list ? 有没有办法将<p>文本和<td>的文本提取到一个列表？

Answer 1

You could do something like 你可以做点什么

>>> doc = """<TABLE>
... <TR>
...     <TD><P>Name</P></TD>
...     <TD><P>Fees</P></TD>
...     <TD><P>Awards</P></TD>
...     <TD><P>Total</P></TD>
... </TR>
... <TR>
...     <TD><P>Tony</P></TD>
...     <TD >7,800</TD>
...     <TD >7</TD>
...     <TD>15,400</TD>
... </TR>
... <TR>
...     <TD><P>Paul</FONT></P></TD>
...     <TD >7,800</TD>
...     <TD >7</TD>
...     <TD>15,400</TD>
... </TR>
... <TR>
...     <TD><P>Richard</P></TD>
...     <TD >7,800</TD>
...     <TD >7</TD>
...     <TD>15,400</TD>
... </TR>
... 
... </TR>
... </TABLE>"""
>>> import lxml.html
>>> root = lxml.html.fromstring(doc)
>>> root.xpath('//tr/td//text()')
['Name', 'Fees', 'Awards', 'Total', 'Tony', '7,800', '7', '15,400', 'Paul', '7,800', '7', '15,400', 'Richard', '7,800', '7', '15,400']
>>>

If you have 2 tables in document, you can first loop on tables and then use a relative XPath expression (with a leading . ) for descendant text nodes on each table 如果文档中有2个表，则可以先在表上循环，然后对每个表上的后代文本节点使用相对 XPath表达式（带前导. ）

>>> doc = """<TABLE>
... <TR>
...     <TD><P>Name</P></TD>
...     <TD><P>Fees</P></TD>
...     <TD><P>Awards</P></TD>
...     <TD><P>Total</P></TD>
... </TR>
... <TR>
...     <TD><P>Tony</P></TD>
...     <TD >7,800</TD>
...     <TD >7</TD>
...     <TD>15,400</TD>
... </TR>
... <TR>
...     <TD><P>Paul</FONT></P></TD>
...     <TD >7,800</TD>
...     <TD >7</TD>
...     <TD>15,400</TD>
... </TR>
... <TR>
...     <TD><P>Richard</P></TD>
...     <TD >7,800</TD>
...     <TD >7</TD>
...     <TD>15,400</TD>
... </TR>
... 
... </TR>
... </TABLE>
... <TABLE>
... <TR>
...     <TD><P>Name</P></TD>
...     <TD><P>Fees</P></TD>
...     <TD><P>Awards</P></TD>
...     <TD><P>Total</P></TD>
... </TR>
... <TR>
...     <TD><P>Tony</P></TD>
...     <TD >7,800</TD>
...     <TD >7</TD>
...     <TD>15,400</TD>
... </TR>
... <TR>
...     <TD><P>Paul</FONT></P></TD>
...     <TD >7,800</TD>
...     <TD >7</TD>
...     <TD>15,400</TD>
... </TR>
... <TR>
...     <TD><P>Richard</P></TD>
...     <TD >7,800</TD>
...     <TD >7</TD>
...     <TD>15,400</TD>
... </TR>
... 
... </TR>
... </TABLE>"""
>>> import lxml.html
>>> root = lxml.html.fromstring(doc)
>>> root.xpath('//tr/td//text()')
['Name', 'Fees', 'Awards', 'Total', 'Tony', '7,800', '7', '15,400', 'Paul', '7,800', '7', '15,400', 'Richard', '7,800', '7', '15,400', 'Name', 'Fees', 'Awards', 'Total', 'Tony', '7,800', '7', '15,400', 'Paul', '7,800', '7', '15,400', 'Richard', '7,800', '7', '15,400']
>>> for tbl in root.xpath('//table'):
...     elements = tbl.xpath('.//tr/td//text()')
...     print elements
... 
['Name', 'Fees', 'Awards', 'Total', 'Tony', '7,800', '7', '15,400', 'Paul', '7,800', '7', '15,400', 'Richard', '7,800', '7', '15,400']
['Name', 'Fees', 'Awards', 'Total', 'Tony', '7,800', '7', '15,400', 'Paul', '7,800', '7', '15,400', 'Richard', '7,800', '7', '15,400']
>>>

python使用lxml解析html表

问题描述

1 个解决方案

解决方案1
7 已采纳 2013-12-06 13:48:10

python使用lxml解析html表

问题描述

1 个解决方案

解决方案1 7 已采纳 2013-12-06 13:48:10

解决方案1
7 已采纳 2013-12-06 13:48:10