[英]Table data Scraping using BeautifulSoup in Python
為什么在 Python 中使用 BeautifulSoup 提取表數據時我沒有獲取所有行?
鏈接到網站 - http://www.fao.org/3/x0490e/x0490e04.htm
table1_rows = table1.find_all('tr')
for tr in table1_rows:
td = tr.find_all('td')
row = [i.text for i in td]
print(row)
print(row)
row = [item.strip() for item in row if str(item)]
row
做了一些改動后
for tr in table1_rows:
td = tr.find_all('td')
row = [i.text for i in td]
row = [item.strip() for item in row if str(item)]
print(row)
然后我也沒有得到 output。 誰能幫幫我嗎? 當我從循環中打印行變量時,我沒有得到 output?
這一行:
row = [item.strip() for item in row if str(item)]
應for tr in table1_rows
:
for tr in table1_rows:
td = tr.find_all('td')
row = [i.text for i in td]
row = [item.strip() for item in row if str(item)]
print(row)
編輯:收集所有行:
all_rows=[]
for tr in table1_rows:
td = tr.find_all('td')
row = [i.text for i in td]
row = [item.strip() for item in row if str(item)]
all_rows.append(row)
for row in all_rows:
print(row)
編輯 2:如果最終目標是將表數據放入 dataframe,那么這是一項單行作業(這取代了 for 循環方法):
df=pd.read_html(url)[0]
您顯然需要先導入 pandas :
import pandas as pd
Output:
print(df)
在下一個 jupyter 塊中時,您似乎處於循環的末尾。 該表的格式也有點奇怪,所以我做了這個來獲取數據和列標題作為嵌套的 dict 列表:
import requests
import pandas as pd
import pprint
from bs4 import BeautifulSoup
url = 'http://www.fao.org/3/x0490e/x0490e04.htm'
response = requests.get(url)
soup = BeautifulSoup(response.content)
table = soup.find('table')
def clean(text):
return text.replace('\r', '').replace('\n', '').replace(' ', '').strip()
# get the column headers
headers = [clean(col.text)
for col in table.find_all('tr')[1].find_all('td')]
# set the first column to 'name' because it is blank
headers.insert(0, 'name')
# get the data rows and zip them to the column headers
data = [{col[0]: clean(col[1].text)
for col in zip(headers, row.find_all('td'))}
for row in table.find_all('tr')[2::]]
data_list = [headers] + [list(row.values()) for row in data]
# print to list of lists
pprint.pprint(data_list)
# pretty print to json
import json
print(json.dumps(data, indent=4))
# print to dataframe
df = pd.DataFrame(data)
print(df)
Output:
[['name', 'mm day-1', 'm3 ha-1 day-1', 'l s-1 ha-1', 'MJ m-2 day-1'],
['1 mm day-1', '1', '10', '0.116', '2.45'],
['1 m3 ha-1 day-1', '0.1', '1', '0.012', '0.245'],
['1 l s-1 ha-1', '8.640', '86.40', '1', '21.17'],
['1 MJ m-2 day-1', '0.408', '4.082', '0.047', '1']]
[
{
"name": "1 mm day-1",
"mm day-1": "1",
"m3 ha-1 day-1": "10",
"l s-1 ha-1": "0.116",
"MJ m-2 day-1": "2.45"
},
{
"name": "1 m3 ha-1 day-1",
"mm day-1": "0.1",
"m3 ha-1 day-1": "1",
"l s-1 ha-1": "0.012",
"MJ m-2 day-1": "0.245"
},
{
"name": "1 l s-1 ha-1",
"mm day-1": "8.640",
"m3 ha-1 day-1": "86.40",
"l s-1 ha-1": "1",
"MJ m-2 day-1": "21.17"
},
{
"name": "1 MJ m-2 day-1",
"mm day-1": "0.408",
"m3 ha-1 day-1": "4.082",
"l s-1 ha-1": "0.047",
"MJ m-2 day-1": "1"
}
]
name mm day-1 m3 ha-1 day-1 l s-1 ha-1 MJ m-2 day-1
0 1 mm day-1 1 10 0.116 2.45
1 1 m3 ha-1 day-1 0.1 1 0.012 0.245
2 1 l s-1 ha-1 8.640 86.40 1 21.17
3 1 MJ m-2 day-1 0.408 4.082 0.047 1
我的 df 的 output
MJ m-2 day-1 l s-1 ha-1 m3 ha-1 day-1 mm day-1 name
0 2.45 0.116 10 1 1 mm day-1
1 0.245 0.012 1 0.1 1 m3 ha-1 day-1
2 21.17 1 86.40 8.640 1 l s-1 ha-1
3 1 0.047 4.082 0.408 1 MJ m-2 day-1
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.