I am trying to parse data from a website. For eg the portion of SRC code looks like this for the site i am trying to extract data from.
<table summary="Customer Pending and Vendor Pending Table">
<tr>
<th> <a href="/team.php?wrkgrp=Somedata&Escalationsort=Level&Escalationorder=0#Escalation" class="headlink">
<img src="/images/rat/up_selected.png" width="11" height="9" border="0" alt="up">Risk </a> </th>
<th> <a href="/team.php?wrkgrp=Somedata&Escalationsort=AvgLastUpd&Escalationorder=1#Escalation" class="headlink">
Avg Last Updated </a> </th>
<th> <a href="/team.php?wrkgrp=Somedata&Escalationsort=AvgDaysOpen&Escalationorder=1#Escalation" class="headlink">
Avg Days Open </a> </th>
<th> <a href="/team.php?wrkgrp=Somedata&Escalationsort=Srs&Escalationorder=1#Escalation" class="headlink">
# of Cases </a> </th>
<th> <a href="/team.php?wrkgrp=Somedata&Escalationsort_pct=1&Escalationorder=1#Escalation" class="headlink">% of Total Cases</a> </th>
</tr>
<tr >
<td><a href="/snapshot.php?statusrisk=2&wrkgrp=Somedata&function=statusrisk&statuses=CustomerPending"><img src="/images/rat/severity_2.gif" alt="Very High Risk" title="Very High Risk" border="0"></a></td>
<td> 8.0</td>
<td> 69.0</td>
<td>1</td>
<td> 3.1</td>
</tr>
I need to extract the values 8.0,69.0 and 3.1 from teh above table. My Python code looks like this.
from lxml import html
import requests
page = requests.get('http://rat-sucker.abc.com/team.php?wrkgrp=somedata')
tree = html.fromstring(page.text)
Stats = tree.xpath(//*[@id="leftrat"]/table[1]/tbody/tr[2]/td[2])
print 'Stats: ', Stats
I have checked my Xpath using several methods and Xcode simulator, it is correct(if you run on the above partial code it may not work), but when my python script is run it does not generate any output.
[root@testbed testhost]# python scrapper.py Stats
[root@testbed testhost]#
You could use BeautifulSoup parser .
>>> s = '''<table summary="Customer Pending and Vendor Pending Table">
<tr>
<th> <a href="/team.php?wrkgrp=Somedata&Escalationsort=Level&Escalationorder=0#Escalation" class="headlink">
<img src="/images/rat/up_selected.png" width="11" height="9" border="0" alt="up">Risk </a> </th>
<th> <a href="/team.php?wrkgrp=Somedata&Escalationsort=AvgLastUpd&Escalationorder=1#Escalation" class="headlink">
Avg Last Updated </a> </th>
<th> <a href="/team.php?wrkgrp=Somedata&Escalationsort=AvgDaysOpen&Escalationorder=1#Escalation" class="headlink">
Avg Days Open </a> </th>
<th> <a href="/team.php?wrkgrp=Somedata&Escalationsort=Srs&Escalationorder=1#Escalation" class="headlink">
# of Cases </a> </th>
<th> <a href="/team.php?wrkgrp=Somedata&Escalationsort_pct=1&Escalationorder=1#Escalation" class="headlink">% of Total Cases</a> </th>
</tr>
<tr >
<td><a href="/snapshot.php?statusrisk=2&wrkgrp=Somedata&function=statusrisk&statuses=CustomerPending"><img src="/images/rat/severity_2.gif" alt="Very High Risk" title="Very High Risk" border="0"></a></td>
<td> 8.0</td>
<td> 69.0</td>
<td>1</td>
<td> 3.1</td>
</tr>'''
>>> soup = BeautifulSoup(s)
>>> [i.text.strip() for i in soup.find_all('td', text=True)]
['8.0', '69.0', '1', '3.1']
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.