[英]Web Scraping tables from an HTML file
您好,我希望在將表格放入HTML文件並將其導入到csv文件中時獲得幫助。 我對網頁抓取非常陌生,因此如果我對自己的代碼完全錯了,請給我。 HTML文件包含我要提取的三個單獨的表。 估計,抽樣誤差和估計中非零圖的數量。
我的代碼如下所示:
#import necessary libraries
import urllib2
import pandas as pd
#specify URL
table = "file:///C:/Users/TMccw/Anaconda2/FiaAPI/outFArea18.html"
#Query the website & return the html to the variable 'page'
page = urllib2.urlopen(table)
#import the bs4 functions to parse the data returned from the website
from bs4 import BeautifulSoup
#Parse the html in the 'page' variable & store it in bs4 format
soup = BeautifulSoup(page, 'html.parser')
#Print out the html code with the function prettify
print soup.prettify()
#Find the tables & check type
table2 = soup.find_all('table')
print(table2)
print type(table2)
#Create new table as a dataframe
new_table = pd.DataFrame(columns=range(0,4))
#Extract the info from the HTML code
soup.find('table').find_all('td'),{'align':'right'}
#Remove the tags and extract table info into CSV
???
這是第一個表“ Estimate”的HTML:
` Estimate:
</b>
</caption>
<tr>
<td>
</td>
<td align="center" colspan="5">
<b>
Ownership group
</b>
</td>
</tr>
<tr>
<th>
<b>
Forest type group
</b>
</th>
<td>
<b>
Total
</b>
</td>
<td>
<b>
National Forest
</b>
</td>
<td>
<b>
Other federal
</b>
</td>
<td>
<b>
State and local
</b>
</td>
<td>
<b>
Private
</b>
</td>
</tr>
<tr>
<td nowrap="">
<b>
Total
</b>
</td>
<td align="right">
4,875,993
</td>
<td align="right">
195,438
</td>
<td align="right">
169,500
</td>
<td align="right">
392,030
</td>
<td align="right">
4,119,025
</td>
</tr>
<tr>
<td nowrap="">
<b>
White / red / jack pine group
</b>
</td>
<td align="right">
40,492
</td>
<td align="right">
3,426
</td>
<td align="right">
-
</td>
<td align="right">
10,850
</td>
<td align="right">
26,217
</td>
</tr>
<tr>
<td nowrap="">
<b>
Loblolly / shortleaf pine group
</b>
</td>
<td align="right">
38,267
</td>
<td align="right">
11,262
</td>
<td align="right">
997
</td>
<td align="right">
4,015
</td>
<td align="right">
21,993
</td>
</tr>
<tr>
<td nowrap="">
<b>
Other eastern softwoods group
</b>
</td>
<td align="right">
25,181
</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
<td align="right">
25,181
</td>
</tr>
<tr>
<td nowrap="">
<b>
Exotic softwoods group
</b>
</td>
<td align="right">
5,868
</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
<td align="right">
662
</td>
<td align="right">
5,206
</td>
</tr>
<tr>
<td nowrap="">
<b>
Oak / pine group
</b>
</td>
<td align="right">
144,238
</td>
<td align="right">
9,592
</td>
<td align="right">
-
</td>
<td align="right">
21,475
</td>
<td align="right">
113,171
</td>
</tr>
<tr>
<td nowrap="">
<b>
Oak / hickory group
</b>
</td>
<td align="right">
3,480,272
</td>
<td align="right">
152,598
</td>
<td align="right">
123,900
</td>
<td align="right">
285,305
</td>
<td align="right">
2,918,470
</td>
</tr>
<tr>
<td nowrap="">
<b>
Oak / gum / cypress group
</b>
</td>
<td align="right">
76,302
</td>
<td align="right">
-
</td>
<td align="right">
12,209
</td>
<td align="right">
9,311
</td>
<td align="right">
54,782
</td>
</tr>
<tr>
<td nowrap="">
<b>
Elm / ash / cottonwood group
</b>
</td>
<td align="right">
652,001
</td>
<td align="right">
7,105
</td>
<td align="right">
25,431
</td>
<td align="right">
46,096
</td>
<td align="right">
573,369
</td>
</tr>
<tr>
<td nowrap="">
<b>
Maple / beech / birch group
</b>
</td>
<td align="right">
346,718
</td>
<td align="right">
10,871
</td>
<td align="right">
818
</td>
<td align="right">
12,748
</td>
<td align="right">
322,281
</td>
</tr>
<tr>
<td nowrap="">
<b>
Other hardwoods group
</b>
</td>
<td align="right">
21,238
</td>
<td align="right">
585
</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
<td align="right">
20,653
</td>
</tr>
<tr>
<td nowrap="">
<b>
Exotic hardwoods group
</b>
</td>
<td align="right">
2,441
</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
<td align="right">
2,441
</td>
</tr>
<tr>
<td nowrap="">
<b>
Nonstocked
</b>
</td>
<td align="right">
42,975
</td>
<td align="right">
-
</td>
<td align="right">
6,144
</td>
<td align="right">
1,570
</td>
<td align="right">
35,261
</td>
</tr>
</table>
<br/>
<table border="4" cellpadding="4" cellspacing="4">
<caption>
<b>`
不確定這里到底是什么問題,但是馬上就能發現一個錯誤,這會讓您有點不滿意。
new_table = pd.DataFrame(columns=range(0-4))
需要是
new_table = pd.DataFrame(columns=range(0,4))
range(0-4)的結果實際上是range(-4),其計算結果為range(0,-4),而您想要的是range(0,4)。 您可以只將range(4)作為參數或range(0,4)傳遞。
我制作了四個與您幾乎相同的表,並將它們放入相當受人尊敬的HTML頁面中。 然后我運行了這段代碼。
>>> import bs4
>>> import pandas as pd
>>> soup = bs4.BeautifulSoup(open('temp.htm').read(), 'html.parser')
>>> tables = soup.findAll('table')
>>> for t, table in enumerate(tables):
... df = pd.read_html(str(table), skiprows=2)
... df[0].to_csv('table%s.csv' % t)
結果是四個這樣的文件,分別名為table0.csv至table3.csv。
,0,1,2,3,4,5
0,Total,4875993,195438,169500,392030,4119025
1,White / red / jack pine group,40492,3426,-,10850,26217
2,Loblolly / shortleaf pine group,38267,11262,997,4015,21993
3,Other eastern softwoods group,25181,-,-,-,25181
4,Exotic softwoods group,5868,-,-,662,5206
5,Oak / pine group,144238,9592,-,21475,113171
6,Oak / hickory group,3480272,152598,123900,285305,2918470
7,Oak / gum / cypress group,76302,-,12209,9311,54782
8,Elm / ash / cottonwood group,652001,7105,25431,46096,573369
9,Maple / beech / birch group,346718,10871,818,12748,322281
10,Other hardwoods group,21238,585,-,-,20653
11,Exotic hardwoods group,2441,-,-,-,2441
12,Nonstocked,42975,-,6144,1570,35261
也許我應該提到的主要事情是,我跳過了BeautifulSoup提供的每個表中的相同行數。 如果表中標題行的數量不同,那么您將不得不做一些更聰明的事情,或者只是丟棄輸出文件中的行,並忽略skiprows
參數。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.