使用 lxml / xpath 解析 html 元素

Question

使用 lxml/python 和 xpath，我检索了我的标签之间的值。 我也想获得 html 属性，而不仅仅是文本，我的程序有效，但跳过了两行。

Python ：

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import lxml.html
htmltree = lxml.html.parse('data.html')
res = htmltree.xpath("//table[@class='mainTable']/tr/td/text()")
print '\n'.join(res).encode("latin-1")

data.html 示例

<table class='mainTable'>
         <TR>
                  <TD bgcolor="#cccccc">235</TD>
                  <TD bgcolor="#cccccc"> Windows XP / Office 2003.</TD>
                  <TD bgcolor="#cccccc">
                  G:\REMI\projets\Migration_XP_Office2003\Procedures\Installation Win XP et Office 2003.doc</TD>
                  <TD bgcolor="#cccccc">2005-10-18</TD>
                  <TD bgcolor="#cccccc">2010-12-30</TD></TR>
                  <TD bgcolor="#cccccc">
                  <P class="MsoBodyText" 
                    style="margin: 0cm 0cm 0pt;"><STRONG><FONT face="Times New Roman" size="5">blablablablablablbala<BR><BR></FONT></STRONG></FONT></P>
                  </TD>
                <TR>
                  <TD bgcolor="#cccccc">23</TD>
                  <TD bgcolor="#cccccc">XEROX/ MAC</TD>
                  <TD bgcolor="#cccccc">
                    <P>joint.</P>
                    <P>&nbsp;</P></TD>
                  <TD bgcolor="#cccccc">G:\DDTH_INF\REMI\bdcfiles\I098_Page_de_garde_MAC.doc</TD>
                  <TD bgcolor="#cccccc">2012-12-19</TD>
                  <TD bgcolor="#cccccc">2012-12-19</TD>
         </TR>
 </table>

返回：

 235 Windows XP / Office 2003.
 G:\REMI\projets\Migration_XP_Office2003\Procedures\Installation Win XP
 et Office 2003.doc 2005-10-18 2010-12-30

 23 XEROX/ MAC G:\DDTH_INF\REMI\bdcfiles\I098_Page_de_garde_MAC.doc
 2012-12-19 2012-12-19

我不明白为什么程序跳过

<P class="MsoBodyText" 
                        style="margin: 0cm 0cm 0pt;"><STRONG><FONT face="Times New Roman" size="5">blablablablablablbala<BR><BR></FONT></STRONG></FONT></P>

和

 <P>joint.</P>
                        <P>&nbsp;</P>

因为它在<p>标签之间？ 我只想获取每个TD之间的所有数据。 我也尝试过 /tr/td/p/ 但这不是解决方案。

注意：此代码是一个示例，html 可能已损坏，但我的文件结构良好。

Answer 1

这是因为您要从每个td元素中获取text() - 这基本上意味着 -给我一个直接位于td元素内的文本节点。

相反，在找到的每个td上调用.text_content() ：

texts = [td.text_content() for td in htmltree.xpath("//table[@class='mainTable']/tr/td")]

使用 lxml / xpath 解析 html 元素

问题描述

1 个解决方案

解决方案1
1 已采纳 2015-11-18 15:46:32

使用 lxml / xpath 解析 html 元素

问题描述

1 个解决方案

解决方案1 1 已采纳 2015-11-18 15:46:32

解决方案1
1 已采纳 2015-11-18 15:46:32