簡體   English   中英

HTML文件中的Web抓取表

[英]Web Scraping tables from an HTML file

您好,我希望在將表格放入HTML文件並將其導入到csv文件中時獲得幫助。 我對網頁抓取非常陌生,因此如果我對自己的代碼完全錯了,請給我。 HTML文件包含我要提取的三個單獨的表。 估計,抽樣誤差和估計中非零圖的數量。

我的代碼如下所示:

#import necessary libraries
import urllib2
import pandas as pd

#specify URL
table = "file:///C:/Users/TMccw/Anaconda2/FiaAPI/outFArea18.html"

#Query the website & return the html to the variable 'page'
page = urllib2.urlopen(table)

#import the bs4 functions to parse the data returned from the website
from bs4 import BeautifulSoup

#Parse the html in the 'page' variable & store it in bs4 format
soup = BeautifulSoup(page, 'html.parser')

#Print out the html code with the function prettify
print soup.prettify()

#Find the tables & check type
table2 = soup.find_all('table')
print(table2)
print type(table2)

#Create new table as a dataframe
new_table = pd.DataFrame(columns=range(0,4))

#Extract the info from the HTML code 
soup.find('table').find_all('td'),{'align':'right'}

#Remove the tags and extract table info into CSV
???

這是第一個表“ Estimate”的HTML:

 ` Estimate:
     </b>
     </caption>
     <tr>
     <td>
     </td>
    <td align="center" colspan="5">
     <b>
      Ownership group
     </b>
    </td>
   </tr>
   <tr>
    <th>
     <b>
      Forest type group
     </b>
    </th>
    <td>
     <b>
      Total
     </b>
    </td>
    <td>
     <b>
      National Forest
     </b>
    </td>
    <td>
     <b>
      Other federal
     </b>
    </td>
    <td>
     <b>
      State and local
     </b>
    </td>
    <td>
     <b>
      Private
     </b>
    </td>
   </tr>
   <tr>
    <td nowrap="">
     <b>
      Total
     </b>
    </td>
    <td align="right">
     4,875,993
    </td>
    <td align="right">
     195,438
    </td>
    <td align="right">
     169,500
    </td>
    <td align="right">
     392,030
    </td>
    <td align="right">
     4,119,025
    </td>
   </tr>
   <tr>
    <td nowrap="">
     <b>
      White / red / jack pine group
     </b>
    </td>
    <td align="right">
     40,492
    </td>
    <td align="right">
     3,426
    </td>
    <td align="right">
     -
    </td>
    <td align="right">
     10,850
    </td>
    <td align="right">
     26,217
    </td>
   </tr>
   <tr>
    <td nowrap="">
     <b>
      Loblolly / shortleaf pine group
     </b>
    </td>
    <td align="right">
     38,267
    </td>
    <td align="right">
     11,262
    </td>
    <td align="right">
     997
    </td>
    <td align="right">
     4,015
    </td>
    <td align="right">
     21,993
    </td>
   </tr>
   <tr>
    <td nowrap="">
     <b>
      Other eastern softwoods group
     </b>
    </td>
    <td align="right">
     25,181
    </td>
    <td align="right">
     -
    </td>
    <td align="right">
     -
    </td>
    <td align="right">
     -
    </td>
    <td align="right">
     25,181
    </td>
   </tr>
   <tr>
    <td nowrap="">
     <b>
      Exotic softwoods group
     </b>
    </td>
    <td align="right">
     5,868
    </td>
    <td align="right">
     -
    </td>
    <td align="right">
     -
    </td>
    <td align="right">
     662
    </td>
    <td align="right">
     5,206
    </td>
   </tr>
   <tr>
    <td nowrap="">
     <b>
      Oak / pine group
     </b>
    </td>
    <td align="right">
     144,238
    </td>
    <td align="right">
     9,592
    </td>
    <td align="right">
     -
    </td>
    <td align="right">
     21,475
    </td>
    <td align="right">
     113,171
    </td>
   </tr>
   <tr>
    <td nowrap="">
     <b>
      Oak / hickory group
     </b>
    </td>
    <td align="right">
     3,480,272
    </td>
    <td align="right">
     152,598
    </td>
    <td align="right">
     123,900
    </td>
    <td align="right">
     285,305
    </td>
    <td align="right">
     2,918,470
    </td>
   </tr>
   <tr>
    <td nowrap="">
     <b>
      Oak / gum / cypress group
     </b>
    </td>
    <td align="right">
     76,302
    </td>
    <td align="right">
     -
    </td>
    <td align="right">
     12,209
    </td>
    <td align="right">
     9,311
    </td>
    <td align="right">
     54,782
    </td>
   </tr>
   <tr>
    <td nowrap="">
     <b>
      Elm / ash / cottonwood group
     </b>
    </td>
    <td align="right">
     652,001
    </td>
    <td align="right">
     7,105
    </td>
    <td align="right">
     25,431
    </td>
    <td align="right">
     46,096
    </td>
    <td align="right">
     573,369
    </td>
   </tr>
   <tr>
    <td nowrap="">
     <b>
      Maple / beech / birch group
     </b>
    </td>
    <td align="right">
     346,718
    </td>
    <td align="right">
     10,871
    </td>
    <td align="right">
     818
    </td>
    <td align="right">
     12,748
    </td>
    <td align="right">
     322,281
    </td>
   </tr>
   <tr>
    <td nowrap="">
     <b>
      Other hardwoods group
     </b>
    </td>
    <td align="right">
     21,238
    </td>
    <td align="right">
     585
    </td>
    <td align="right">
     -
    </td>
    <td align="right">
     -
    </td>
    <td align="right">
     20,653
    </td>
   </tr>
   <tr>
    <td nowrap="">
     <b>
      Exotic hardwoods group
     </b>
    </td>
    <td align="right">
     2,441
    </td>
    <td align="right">
     -
    </td>
    <td align="right">
     -
    </td>
    <td align="right">
     -
    </td>
    <td align="right">
     2,441
    </td>
   </tr>
   <tr>
    <td nowrap="">
     <b>
      Nonstocked
     </b>
    </td>
    <td align="right">
     42,975
    </td>
    <td align="right">
     -
    </td>
    <td align="right">
     6,144
    </td>
    <td align="right">
     1,570
    </td>
    <td align="right">
     35,261
    </td>
   </tr>
  </table>
  <br/>
  <table border="4" cellpadding="4" cellspacing="4">
   <caption>
    <b>`

不確定這里到底是什么問題,但是馬上就能發現一個錯誤,這會讓您有點不滿意。

new_table = pd.DataFrame(columns=range(0-4))

需要是

new_table = pd.DataFrame(columns=range(0,4))

range(0-4)的結果實際上是range(-4),其計算結果為range(0,-4),而您想要的是range(0,4)。 您可以只將range(4)作為參數或range(0,4)傳遞。

我制作了四個與您幾乎相同的表,並將它們放入相當受人尊敬的HTML頁面中。 然后我運行了這段代碼。

>>> import bs4
>>> import pandas as pd
>>> soup = bs4.BeautifulSoup(open('temp.htm').read(), 'html.parser')
>>> tables = soup.findAll('table')
>>> for t, table in enumerate(tables):
...     df = pd.read_html(str(table), skiprows=2)
...     df[0].to_csv('table%s.csv' % t)

結果是四個這樣的文件,分別名為table0.csv至table3.csv。

,0,1,2,3,4,5
0,Total,4875993,195438,169500,392030,4119025
1,White / red / jack pine group,40492,3426,-,10850,26217
2,Loblolly / shortleaf pine group,38267,11262,997,4015,21993
3,Other eastern softwoods group,25181,-,-,-,25181
4,Exotic softwoods group,5868,-,-,662,5206
5,Oak / pine group,144238,9592,-,21475,113171
6,Oak / hickory group,3480272,152598,123900,285305,2918470
7,Oak / gum / cypress group,76302,-,12209,9311,54782
8,Elm / ash / cottonwood group,652001,7105,25431,46096,573369
9,Maple / beech / birch group,346718,10871,818,12748,322281
10,Other hardwoods group,21238,585,-,-,20653
11,Exotic hardwoods group,2441,-,-,-,2441
12,Nonstocked,42975,-,6144,1570,35261

也許我應該提到的主要事情是,我跳過了BeautifulSoup提供的每個表中的相同行數。 如果表中標題行的數量不同,那么您將不得不做一些更聰明的事情,或者只是丟棄輸出文件中的行,並忽略skiprows參數。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM