[英]How to get a html elements with python lxml
I have this html code: 我有这个HTML代码:
<table>
<tr>
<td class="test"><b><a href="">aaa</a></b></td>
<td class="test">bbb</td>
<td class="test">ccc</td>
<td class="test"><small>ddd</small></td>
</tr>
<tr>
<td class="test"><b><a href="">eee</a></b></td>
<td class="test">fff</td>
<td class="test">ggg</td>
<td class="test"><small>hhh</small></td>
</tr>
</table>
I use this Python code to extract all <td class="test">
with lxml module. 我使用这个Python代码用lxml模块提取所有
<td class="test">
。
import urllib2
import lxml.html
code = urllib.urlopen("http://www.example.com/page.html").read()
html = lxml.html.fromstring(code)
result = html.xpath('//td[@class="test"][position() = 1 or position() = 4]')
It works good! 它很好用! The result is:
结果是:
<td class="test"><b><a href="">aaa</a></b></td>
<td class="test"><small>ddd</small></td>
<td class="test"><b><a href="">eee</a></b></td>
<td class="test"><small>hhh</small></td>
(so the first and the fourth column of each <tr>
) Now, I have to extract: (所以每个
<tr>
的第一列和第四列)现在,我必须提取:
aaa (the title of the link)
aaa (链接的标题)
ddd (text between
<small>
tag)ddd (
<small>
标签之间的文字)eee (the title of the link)
eee (链接的标题)
hhh (text between
<small>
tag)hhh (
<small>
标签之间的文字)
How could I extract these values? 我怎样才能提取这些值?
(the problem is that I have to remove <b>
tag and get the title of the anchor on the first column and remove <small>
tag on the forth column) (问题是我必须删除
<b>
标签并在第一列上获取锚点的标题并删除第四列上的<small>
标记)
Thank you! 谢谢!
如果你执行el.text_content()
你将从每个元素中删除所有标记内容,即:
result = [el.text_content() for el in result]
Why dont you just fetch what you want in each step? 为什么不在每一步中获取你想要的东西?
links = [el.text for el in html.xpath('//td[@class="test"][position() = 1]/b/a')]
smalls = [el.text for el in html.xpath('//td[@class="test"][position() = 4]/small')]
print zip(links, smalls)
# => [('aaa', 'ddd'), ('eee', 'hhh')]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.