[英]Extract Data from a html table using Python
我正在尝试从网站解析数据。 例如,我要从中提取数据的站点的SRC代码部分看起来像这样。
<table summary="Customer Pending and Vendor Pending Table">
<tr>
<th> <a href="/team.php?wrkgrp=Somedata&Escalationsort=Level&Escalationorder=0#Escalation" class="headlink">
<img src="/images/rat/up_selected.png" width="11" height="9" border="0" alt="up">Risk </a> </th>
<th> <a href="/team.php?wrkgrp=Somedata&Escalationsort=AvgLastUpd&Escalationorder=1#Escalation" class="headlink">
Avg Last Updated </a> </th>
<th> <a href="/team.php?wrkgrp=Somedata&Escalationsort=AvgDaysOpen&Escalationorder=1#Escalation" class="headlink">
Avg Days Open </a> </th>
<th> <a href="/team.php?wrkgrp=Somedata&Escalationsort=Srs&Escalationorder=1#Escalation" class="headlink">
# of Cases </a> </th>
<th> <a href="/team.php?wrkgrp=Somedata&Escalationsort_pct=1&Escalationorder=1#Escalation" class="headlink">% of Total Cases</a> </th>
</tr>
<tr >
<td><a href="/snapshot.php?statusrisk=2&wrkgrp=Somedata&function=statusrisk&statuses=CustomerPending"><img src="/images/rat/severity_2.gif" alt="Very High Risk" title="Very High Risk" border="0"></a></td>
<td> 8.0</td>
<td> 69.0</td>
<td>1</td>
<td> 3.1</td>
</tr>
我需要从上表中提取值8.0、69.0和3.1。 我的Python代码如下所示。
from lxml import html
import requests
page = requests.get('http://rat-sucker.abc.com/team.php?wrkgrp=somedata')
tree = html.fromstring(page.text)
Stats = tree.xpath(//*[@id="leftrat"]/table[1]/tbody/tr[2]/td[2])
print 'Stats: ', Stats
我已经使用几种方法和Xcode模拟器检查了我的Xpath,它是正确的(如果在上面的部分代码上运行,它可能无法工作),但是当我的python脚本运行时,它不会生成任何输出。
[root @ testbed testhost]#python scrapper.py统计信息
[root @ testbed testhost]#
您可以使用BeautifulSoup解析器 。
>>> s = '''<table summary="Customer Pending and Vendor Pending Table">
<tr>
<th> <a href="/team.php?wrkgrp=Somedata&Escalationsort=Level&Escalationorder=0#Escalation" class="headlink">
<img src="/images/rat/up_selected.png" width="11" height="9" border="0" alt="up">Risk </a> </th>
<th> <a href="/team.php?wrkgrp=Somedata&Escalationsort=AvgLastUpd&Escalationorder=1#Escalation" class="headlink">
Avg Last Updated </a> </th>
<th> <a href="/team.php?wrkgrp=Somedata&Escalationsort=AvgDaysOpen&Escalationorder=1#Escalation" class="headlink">
Avg Days Open </a> </th>
<th> <a href="/team.php?wrkgrp=Somedata&Escalationsort=Srs&Escalationorder=1#Escalation" class="headlink">
# of Cases </a> </th>
<th> <a href="/team.php?wrkgrp=Somedata&Escalationsort_pct=1&Escalationorder=1#Escalation" class="headlink">% of Total Cases</a> </th>
</tr>
<tr >
<td><a href="/snapshot.php?statusrisk=2&wrkgrp=Somedata&function=statusrisk&statuses=CustomerPending"><img src="/images/rat/severity_2.gif" alt="Very High Risk" title="Very High Risk" border="0"></a></td>
<td> 8.0</td>
<td> 69.0</td>
<td>1</td>
<td> 3.1</td>
</tr>'''
>>> soup = BeautifulSoup(s)
>>> [i.text.strip() for i in soup.find_all('td', text=True)]
['8.0', '69.0', '1', '3.1']
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.