简体   繁体   English

如何使用python lxml获取html元素

[英]How to get a html elements with python lxml

I have this html code: 我有这个HTML代码:

<table>
 <tr>
  <td class="test"><b><a href="">aaa</a></b></td>
  <td class="test">bbb</td>
  <td class="test">ccc</td>
  <td class="test"><small>ddd</small></td>
 </tr>
 <tr>
  <td class="test"><b><a href="">eee</a></b></td>
  <td class="test">fff</td>
  <td class="test">ggg</td>
  <td class="test"><small>hhh</small></td>
 </tr>
</table>

I use this Python code to extract all <td class="test"> with lxml module. 我使用这个Python代码用lxml模块提取所有<td class="test">

import urllib2
import lxml.html

code   = urllib.urlopen("http://www.example.com/page.html").read()
html   = lxml.html.fromstring(code)
result = html.xpath('//td[@class="test"][position() = 1 or position() = 4]')

It works good! 它很好用! The result is: 结果是:

<td class="test"><b><a href="">aaa</a></b></td>
<td class="test"><small>ddd</small></td>


<td class="test"><b><a href="">eee</a></b></td>
<td class="test"><small>hhh</small></td>

(so the first and the fourth column of each <tr> ) Now, I have to extract: (所以每个<tr>的第一列和第四列)现在,我必须提取:

aaa (the title of the link) aaa (链接的标题)

ddd (text between <small> tag) ddd<small>标签之间的文字)

eee (the title of the link) eee (链接的标题)

hhh (text between <small> tag) hhh<small>标签之间的文字)

How could I extract these values? 我怎样才能提取这些值?

(the problem is that I have to remove <b> tag and get the title of the anchor on the first column and remove <small> tag on the forth column) (问题是我必须删除<b>标签并在第一列上获取锚点的标题并删除第四列上的<small>标记)

Thank you! 谢谢!

如果你执行el.text_content()你将从每个元素中删除所有标记内容,即:

result = [el.text_content() for el in result]

Why dont you just fetch what you want in each step? 为什么不在每一步中获取你想要的东西?

links = [el.text for el in html.xpath('//td[@class="test"][position() = 1]/b/a')]
smalls = [el.text for el in html.xpath('//td[@class="test"][position() = 4]/small')]
print zip(links, smalls) 
# => [('aaa', 'ddd'), ('eee', 'hhh')]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM