在 Python 中使用 BeautifulSoup 進行表數據抓取

Question

為什么在 Python 中使用 BeautifulSoup 提取表數據時我沒有獲取所有行？

鏈接到網站 - http://www.fao.org/3/x0490e/x0490e04.htm

table1_rows = table1.find_all('tr')

for tr in table1_rows:
td = tr.find_all('td')
row = [i.text for i in td]
print(row)

上述代碼的 Output

print(row)
row = [item.strip() for item in row if str(item)]
row

但我得到了這個 output

做了一些改動后

for tr in table1_rows:
td = tr.find_all('td')
row = [i.text for i in td]
row = [item.strip() for item in row if str(item)]
print(row)

然后我也沒有得到 output。 誰能幫幫我嗎？ 當我從循環中打印行變量時，我沒有得到 output？

Output

Answer 1

這一行：

row = [item.strip() for item in row if str(item)]

應for tr in table1_rows ：

for tr in table1_rows:
    td = tr.find_all('td')
    row = [i.text for i in td]
    row = [item.strip() for item in row if str(item)]
    print(row)

編輯：收集所有行：

all_rows=[]
for tr in table1_rows:
    td = tr.find_all('td')
    row = [i.text for i in td]
    row = [item.strip() for item in row if str(item)]
    all_rows.append(row)

for row in all_rows:
    print(row)

編輯 2：如果最終目標是將表數據放入 dataframe，那么這是一項單行作業（這取代了 for 循環方法）：

df=pd.read_html(url)[0]

您顯然需要先導入 pandas ：

import pandas as pd

Output：

print(df)

Answer 2

在下一個 jupyter 塊中時，您似乎處於循環的末尾。 該表的格式也有點奇怪，所以我做了這個來獲取數據和列標題作為嵌套的 dict 列表：

import requests
import pandas as pd
import pprint
from bs4 import BeautifulSoup


url = 'http://www.fao.org/3/x0490e/x0490e04.htm'
response = requests.get(url)
soup = BeautifulSoup(response.content)

table = soup.find('table')

def clean(text):
    return text.replace('\r', '').replace('\n', '').replace('  ', '').strip()

# get the column headers
headers = [clean(col.text)
           for col in table.find_all('tr')[1].find_all('td')]
# set the first column to 'name' because it is blank
headers.insert(0, 'name') 

# get the data rows and zip them to the column headers
data = [{col[0]: clean(col[1].text)
         for col in zip(headers, row.find_all('td'))}
        for row in table.find_all('tr')[2::]]

data_list = [headers] + [list(row.values()) for row in data]

# print to list of lists
pprint.pprint(data_list)
# pretty print to json
import json
print(json.dumps(data, indent=4))
# print to dataframe
df = pd.DataFrame(data)
print(df)

Output：

[['name', 'mm day-1', 'm3 ha-1 day-1', 'l s-1 ha-1', 'MJ m-2 day-1'],
 ['1 mm day-1', '1', '10', '0.116', '2.45'],
 ['1 m3 ha-1 day-1', '0.1', '1', '0.012', '0.245'],
 ['1 l s-1 ha-1', '8.640', '86.40', '1', '21.17'],
 ['1 MJ m-2 day-1', '0.408', '4.082', '0.047', '1']]
[
    {
        "name": "1 mm day-1",
        "mm day-1": "1",
        "m3 ha-1 day-1": "10",
        "l s-1 ha-1": "0.116",
        "MJ m-2 day-1": "2.45"
    },
    {
        "name": "1 m3 ha-1 day-1",
        "mm day-1": "0.1",
        "m3 ha-1 day-1": "1",
        "l s-1 ha-1": "0.012",
        "MJ m-2 day-1": "0.245"
    },
    {
        "name": "1 l s-1 ha-1",
        "mm day-1": "8.640",
        "m3 ha-1 day-1": "86.40",
        "l s-1 ha-1": "1",
        "MJ m-2 day-1": "21.17"
    },
    {
        "name": "1 MJ m-2 day-1",
        "mm day-1": "0.408",
        "m3 ha-1 day-1": "4.082",
        "l s-1 ha-1": "0.047",
        "MJ m-2 day-1": "1"
    }
]
              name mm day-1 m3 ha-1 day-1 l s-1 ha-1 MJ m-2 day-1
0       1 mm day-1        1            10      0.116         2.45
1  1 m3 ha-1 day-1      0.1             1      0.012        0.245
2     1 l s-1 ha-1    8.640         86.40          1        21.17
3   1 MJ m-2 day-1    0.408         4.082      0.047            1

我的 df 的 output

     MJ m-2 day-1 l s-1 ha-1 m3 ha-1 day-1 mm day-1             name
    0         2.45      0.116            10        1       1 mm day-1
    1        0.245      0.012             1      0.1  1 m3 ha-1 day-1
    2        21.17          1         86.40    8.640     1 l s-1 ha-1
    3            1      0.047         4.082    0.408   1 MJ m-2 day-1

在 Python 中使用 BeautifulSoup 進行表數據抓取

問題描述

2 個解決方案

解決方案1
0 已采納 2020-05-20 13:50:07

解決方案2
0 2020-05-20 14:03:46

在 Python 中使用 BeautifulSoup 進行表數據抓取

問題描述

2 個解決方案

解決方案1 0 已采納 2020-05-20 13:50:07

解決方案2 0 2020-05-20 14:03:46

解決方案1
0 已采納 2020-05-20 13:50:07

解決方案2
0 2020-05-20 14:03:46