[英]Table data Scraping using BeautifulSoup in Python
Why i'm not getting the all rows while extracting the table data using BeautifulSoup in Python?为什么在 Python 中使用 BeautifulSoup 提取表数据时我没有获取所有行?
Link to website - http://www.fao.org/3/x0490e/x0490e04.htm链接到网站 - http://www.fao.org/3/x0490e/x0490e04.htm
table1_rows = table1.find_all('tr')
for tr in table1_rows:
td = tr.find_all('td')
row = [i.text for i in td]
print(row)
Output of above code上述代码的 Output
print(row)
row = [item.strip() for item in row if str(item)]
row
But i'm getting this output但我得到了这个 output
After doing some changes做了一些改动后
for tr in table1_rows:
td = tr.find_all('td')
row = [i.text for i in td]
row = [item.strip() for item in row if str(item)]
print(row)
Then also i'm not getting the output.然后我也没有得到 output。 Can anyone please help me?
谁能帮帮我吗? When i'm printing the row variable out of the loop then i'm not getting the output?
当我从循环中打印行变量时,我没有得到 output?
This line:这一行:
row = [item.strip() for item in row if str(item)]
should sit inside the for tr in table1_rows
loop:应
for tr in table1_rows
:
for tr in table1_rows:
td = tr.find_all('td')
row = [i.text for i in td]
row = [item.strip() for item in row if str(item)]
print(row)
Edit: To collect all rows:编辑:收集所有行:
all_rows=[]
for tr in table1_rows:
td = tr.find_all('td')
row = [i.text for i in td]
row = [item.strip() for item in row if str(item)]
all_rows.append(row)
for row in all_rows:
print(row)
Edit 2: If the ultimate aim is to get the table data into a dataframe, then that's a one liner job (this replaces the for loop approach):编辑 2:如果最终目标是将表数据放入 dataframe,那么这是一项单行作业(这取代了 for 循环方法):
df=pd.read_html(url)[0]
You obviously need to import pandas first:您显然需要先导入 pandas :
import pandas as pd
Output: Output:
print(df)
It looks like you are at the end of the loop when in the next jupyter block.在下一个 jupyter 块中时,您似乎处于循环的末尾。 That table is kinda weirdly formatted too so I went and made this to get the data and column headers as a nested dict list:
该表的格式也有点奇怪,所以我做了这个来获取数据和列标题作为嵌套的 dict 列表:
import requests
import pandas as pd
import pprint
from bs4 import BeautifulSoup
url = 'http://www.fao.org/3/x0490e/x0490e04.htm'
response = requests.get(url)
soup = BeautifulSoup(response.content)
table = soup.find('table')
def clean(text):
return text.replace('\r', '').replace('\n', '').replace(' ', '').strip()
# get the column headers
headers = [clean(col.text)
for col in table.find_all('tr')[1].find_all('td')]
# set the first column to 'name' because it is blank
headers.insert(0, 'name')
# get the data rows and zip them to the column headers
data = [{col[0]: clean(col[1].text)
for col in zip(headers, row.find_all('td'))}
for row in table.find_all('tr')[2::]]
data_list = [headers] + [list(row.values()) for row in data]
# print to list of lists
pprint.pprint(data_list)
# pretty print to json
import json
print(json.dumps(data, indent=4))
# print to dataframe
df = pd.DataFrame(data)
print(df)
Output: Output:
[['name', 'mm day-1', 'm3 ha-1 day-1', 'l s-1 ha-1', 'MJ m-2 day-1'],
['1 mm day-1', '1', '10', '0.116', '2.45'],
['1 m3 ha-1 day-1', '0.1', '1', '0.012', '0.245'],
['1 l s-1 ha-1', '8.640', '86.40', '1', '21.17'],
['1 MJ m-2 day-1', '0.408', '4.082', '0.047', '1']]
[
{
"name": "1 mm day-1",
"mm day-1": "1",
"m3 ha-1 day-1": "10",
"l s-1 ha-1": "0.116",
"MJ m-2 day-1": "2.45"
},
{
"name": "1 m3 ha-1 day-1",
"mm day-1": "0.1",
"m3 ha-1 day-1": "1",
"l s-1 ha-1": "0.012",
"MJ m-2 day-1": "0.245"
},
{
"name": "1 l s-1 ha-1",
"mm day-1": "8.640",
"m3 ha-1 day-1": "86.40",
"l s-1 ha-1": "1",
"MJ m-2 day-1": "21.17"
},
{
"name": "1 MJ m-2 day-1",
"mm day-1": "0.408",
"m3 ha-1 day-1": "4.082",
"l s-1 ha-1": "0.047",
"MJ m-2 day-1": "1"
}
]
name mm day-1 m3 ha-1 day-1 l s-1 ha-1 MJ m-2 day-1
0 1 mm day-1 1 10 0.116 2.45
1 1 m3 ha-1 day-1 0.1 1 0.012 0.245
2 1 l s-1 ha-1 8.640 86.40 1 21.17
3 1 MJ m-2 day-1 0.408 4.082 0.047 1
My output of df我的 df 的 output
MJ m-2 day-1 l s-1 ha-1 m3 ha-1 day-1 mm day-1 name
0 2.45 0.116 10 1 1 mm day-1
1 0.245 0.012 1 0.1 1 m3 ha-1 day-1
2 21.17 1 86.40 8.640 1 l s-1 ha-1
3 1 0.047 4.082 0.408 1 MJ m-2 day-1
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.