使用Python从html表中提取数据

Question

I am trying to parse data from a website. 我正在尝试从网站解析数据。 For eg the portion of SRC code looks like this for the site i am trying to extract data from. 例如，我要从中提取数据的站点的SRC代码部分看起来像这样。

<table summary="Customer Pending and Vendor Pending Table">
  <tr>
        <th> <a href="/team.php?wrkgrp=Somedata&Escalationsort=Level&Escalationorder=0#Escalation" class="headlink">
  <img src="/images/rat/up_selected.png" width="11" height="9" border="0" alt="up">Risk          </a> </th>
        <th> <a href="/team.php?wrkgrp=Somedata&Escalationsort=AvgLastUpd&Escalationorder=1#Escalation" class="headlink">
  Avg Last Updated          </a> </th>
        <th> <a href="/team.php?wrkgrp=Somedata&Escalationsort=AvgDaysOpen&Escalationorder=1#Escalation" class="headlink">
  Avg Days Open          </a> </th>
        <th> <a href="/team.php?wrkgrp=Somedata&Escalationsort=Srs&Escalationorder=1#Escalation" class="headlink">
  # of Cases          </a> </th>
        <th> <a href="/team.php?wrkgrp=Somedata&Escalationsort_pct=1&Escalationorder=1#Escalation" class="headlink">% of Total Cases</a> </th>
  </tr>
        <tr >
  <td><a href="/snapshot.php?statusrisk=2&wrkgrp=Somedata&function=statusrisk&statuses=CustomerPending"><img src="/images/rat/severity_2.gif" alt="Very High Risk" title="Very High Risk" border="0"></a></td>
    <td>   8.0</td>
    <td>  69.0</td>
    <td>1</td>
    <td>   3.1</td>
  </tr>

I need to extract the values 8.0,69.0 and 3.1 from teh above table. 我需要从上表中提取值8.0、69.0和3.1。 My Python code looks like this. 我的Python代码如下所示。

from lxml import html
import requests

page = requests.get('http://rat-sucker.abc.com/team.php?wrkgrp=somedata')
tree = html.fromstring(page.text)
Stats = tree.xpath(//*[@id="leftrat"]/table[1]/tbody/tr[2]/td[2])

print 'Stats: ', Stats

I have checked my Xpath using several methods and Xcode simulator, it is correct(if you run on the above partial code it may not work), but when my python script is run it does not generate any output. 我已经使用几种方法和Xcode模拟器检查了我的Xpath，它是正确的（如果在上面的部分代码上运行，它可能无法工作），但是当我的python脚本运行时，它不会生成任何输出。

[root@testbed testhost]# python scrapper.py Stats [root @ testbed testhost]＃python scrapper.py统计信息

[root@testbed testhost]# [root @ testbed testhost]＃

Answer 1

You could use BeautifulSoup parser . 您可以使用BeautifulSoup解析器。

>>> s = '''<table summary="Customer Pending and Vendor Pending Table">
  <tr>
        <th> <a href="/team.php?wrkgrp=Somedata&Escalationsort=Level&Escalationorder=0#Escalation" class="headlink">
  <img src="/images/rat/up_selected.png" width="11" height="9" border="0" alt="up">Risk          </a> </th>
        <th> <a href="/team.php?wrkgrp=Somedata&Escalationsort=AvgLastUpd&Escalationorder=1#Escalation" class="headlink">
  Avg Last Updated          </a> </th>
        <th> <a href="/team.php?wrkgrp=Somedata&Escalationsort=AvgDaysOpen&Escalationorder=1#Escalation" class="headlink">
  Avg Days Open          </a> </th>
        <th> <a href="/team.php?wrkgrp=Somedata&Escalationsort=Srs&Escalationorder=1#Escalation" class="headlink">
  # of Cases          </a> </th>
        <th> <a href="/team.php?wrkgrp=Somedata&Escalationsort_pct=1&Escalationorder=1#Escalation" class="headlink">% of Total Cases</a> </th>
  </tr>
        <tr >
  <td><a href="/snapshot.php?statusrisk=2&wrkgrp=Somedata&function=statusrisk&statuses=CustomerPending"><img src="/images/rat/severity_2.gif" alt="Very High Risk" title="Very High Risk" border="0"></a></td>
    <td>   8.0</td>
    <td>  69.0</td>
    <td>1</td>
    <td>   3.1</td>
  </tr>'''
>>> soup = BeautifulSoup(s)
>>> [i.text.strip() for i in soup.find_all('td', text=True)]
['8.0', '69.0', '1', '3.1']

使用Python从html表中提取数据

问题描述

1 个解决方案

解决方案1
4 2015-02-10 14:36:36

使用Python从html表中提取数据

问题描述

1 个解决方案

解决方案1 4 2015-02-10 14:36:36

解决方案1
4 2015-02-10 14:36:36