使用Python從html表中提取數據

Question

我正在嘗試從網站解析數據。 例如，我要從中提取數據的站點的SRC代碼部分看起來像這樣。

<table summary="Customer Pending and Vendor Pending Table">
  <tr>
        <th> <a href="/team.php?wrkgrp=Somedata&Escalationsort=Level&Escalationorder=0#Escalation" class="headlink">
  <img src="/images/rat/up_selected.png" width="11" height="9" border="0" alt="up">Risk          </a> </th>
        <th> <a href="/team.php?wrkgrp=Somedata&Escalationsort=AvgLastUpd&Escalationorder=1#Escalation" class="headlink">
  Avg Last Updated          </a> </th>
        <th> <a href="/team.php?wrkgrp=Somedata&Escalationsort=AvgDaysOpen&Escalationorder=1#Escalation" class="headlink">
  Avg Days Open          </a> </th>
        <th> <a href="/team.php?wrkgrp=Somedata&Escalationsort=Srs&Escalationorder=1#Escalation" class="headlink">
  # of Cases          </a> </th>
        <th> <a href="/team.php?wrkgrp=Somedata&Escalationsort_pct=1&Escalationorder=1#Escalation" class="headlink">% of Total Cases</a> </th>
  </tr>
        <tr >
  <td><a href="/snapshot.php?statusrisk=2&wrkgrp=Somedata&function=statusrisk&statuses=CustomerPending"><img src="/images/rat/severity_2.gif" alt="Very High Risk" title="Very High Risk" border="0"></a></td>
    <td>   8.0</td>
    <td>  69.0</td>
    <td>1</td>
    <td>   3.1</td>
  </tr>

我需要從上表中提取值8.0、69.0和3.1。 我的Python代碼如下所示。

from lxml import html
import requests

page = requests.get('http://rat-sucker.abc.com/team.php?wrkgrp=somedata')
tree = html.fromstring(page.text)
Stats = tree.xpath(//*[@id="leftrat"]/table[1]/tbody/tr[2]/td[2])

print 'Stats: ', Stats

我已經使用幾種方法和Xcode模擬器檢查了我的Xpath，它是正確的（如果在上面的部分代碼上運行，它可能無法工作），但是當我的python腳本運行時，它不會生成任何輸出。

[root @ testbed testhost]＃python scrapper.py統計信息

[root @ testbed testhost]＃

Answer 1

您可以使用BeautifulSoup解析器。

>>> s = '''<table summary="Customer Pending and Vendor Pending Table">
  <tr>
        <th> <a href="/team.php?wrkgrp=Somedata&Escalationsort=Level&Escalationorder=0#Escalation" class="headlink">
  <img src="/images/rat/up_selected.png" width="11" height="9" border="0" alt="up">Risk          </a> </th>
        <th> <a href="/team.php?wrkgrp=Somedata&Escalationsort=AvgLastUpd&Escalationorder=1#Escalation" class="headlink">
  Avg Last Updated          </a> </th>
        <th> <a href="/team.php?wrkgrp=Somedata&Escalationsort=AvgDaysOpen&Escalationorder=1#Escalation" class="headlink">
  Avg Days Open          </a> </th>
        <th> <a href="/team.php?wrkgrp=Somedata&Escalationsort=Srs&Escalationorder=1#Escalation" class="headlink">
  # of Cases          </a> </th>
        <th> <a href="/team.php?wrkgrp=Somedata&Escalationsort_pct=1&Escalationorder=1#Escalation" class="headlink">% of Total Cases</a> </th>
  </tr>
        <tr >
  <td><a href="/snapshot.php?statusrisk=2&wrkgrp=Somedata&function=statusrisk&statuses=CustomerPending"><img src="/images/rat/severity_2.gif" alt="Very High Risk" title="Very High Risk" border="0"></a></td>
    <td>   8.0</td>
    <td>  69.0</td>
    <td>1</td>
    <td>   3.1</td>
  </tr>'''
>>> soup = BeautifulSoup(s)
>>> [i.text.strip() for i in soup.find_all('td', text=True)]
['8.0', '69.0', '1', '3.1']

使用Python從html表中提取數據

問題描述

1 個解決方案

解決方案1
4 2015-02-10 14:36:36

使用Python從html表中提取數據

問題描述

1 個解決方案

解決方案1 4 2015-02-10 14:36:36

解決方案1
4 2015-02-10 14:36:36