在 Python 中使用 BeautifulSoup 进行表数据抓取

Question

Why i'm not getting the all rows while extracting the table data using BeautifulSoup in Python?为什么在 Python 中使用 BeautifulSoup 提取表数据时我没有获取所有行？

Link to website - http://www.fao.org/3/x0490e/x0490e04.htm链接到网站 - http://www.fao.org/3/x0490e/x0490e04.htm

table1_rows = table1.find_all('tr')

for tr in table1_rows:
td = tr.find_all('td')
row = [i.text for i in td]
print(row)

Output of above code上述代码的 Output

print(row)
row = [item.strip() for item in row if str(item)]
row

But i'm getting this output但我得到了这个 output

After doing some changes做了一些改动后

for tr in table1_rows:
td = tr.find_all('td')
row = [i.text for i in td]
row = [item.strip() for item in row if str(item)]
print(row)

Then also i'm not getting the output.然后我也没有得到 output。 Can anyone please help me?谁能帮帮我吗？ When i'm printing the row variable out of the loop then i'm not getting the output?当我从循环中打印行变量时，我没有得到 output？

Output Output

Answer 1

This line:这一行：

row = [item.strip() for item in row if str(item)]

should sit inside the for tr in table1_rows loop:应for tr in table1_rows ：

for tr in table1_rows:
    td = tr.find_all('td')
    row = [i.text for i in td]
    row = [item.strip() for item in row if str(item)]
    print(row)

Edit: To collect all rows:编辑：收集所有行：

all_rows=[]
for tr in table1_rows:
    td = tr.find_all('td')
    row = [i.text for i in td]
    row = [item.strip() for item in row if str(item)]
    all_rows.append(row)

for row in all_rows:
    print(row)

Edit 2: If the ultimate aim is to get the table data into a dataframe, then that's a one liner job (this replaces the for loop approach):编辑 2：如果最终目标是将表数据放入 dataframe，那么这是一项单行作业（这取代了 for 循环方法）：

df=pd.read_html(url)[0]

You obviously need to import pandas first:您显然需要先导入 pandas ：

import pandas as pd

Output: Output：

print(df)

Answer 2

It looks like you are at the end of the loop when in the next jupyter block.在下一个 jupyter 块中时，您似乎处于循环的末尾。 That table is kinda weirdly formatted too so I went and made this to get the data and column headers as a nested dict list:该表的格式也有点奇怪，所以我做了这个来获取数据和列标题作为嵌套的 dict 列表：

import requests
import pandas as pd
import pprint
from bs4 import BeautifulSoup


url = 'http://www.fao.org/3/x0490e/x0490e04.htm'
response = requests.get(url)
soup = BeautifulSoup(response.content)

table = soup.find('table')

def clean(text):
    return text.replace('\r', '').replace('\n', '').replace('  ', '').strip()

# get the column headers
headers = [clean(col.text)
           for col in table.find_all('tr')[1].find_all('td')]
# set the first column to 'name' because it is blank
headers.insert(0, 'name') 

# get the data rows and zip them to the column headers
data = [{col[0]: clean(col[1].text)
         for col in zip(headers, row.find_all('td'))}
        for row in table.find_all('tr')[2::]]

data_list = [headers] + [list(row.values()) for row in data]

# print to list of lists
pprint.pprint(data_list)
# pretty print to json
import json
print(json.dumps(data, indent=4))
# print to dataframe
df = pd.DataFrame(data)
print(df)

Output: Output：

[['name', 'mm day-1', 'm3 ha-1 day-1', 'l s-1 ha-1', 'MJ m-2 day-1'],
 ['1 mm day-1', '1', '10', '0.116', '2.45'],
 ['1 m3 ha-1 day-1', '0.1', '1', '0.012', '0.245'],
 ['1 l s-1 ha-1', '8.640', '86.40', '1', '21.17'],
 ['1 MJ m-2 day-1', '0.408', '4.082', '0.047', '1']]
[
    {
        "name": "1 mm day-1",
        "mm day-1": "1",
        "m3 ha-1 day-1": "10",
        "l s-1 ha-1": "0.116",
        "MJ m-2 day-1": "2.45"
    },
    {
        "name": "1 m3 ha-1 day-1",
        "mm day-1": "0.1",
        "m3 ha-1 day-1": "1",
        "l s-1 ha-1": "0.012",
        "MJ m-2 day-1": "0.245"
    },
    {
        "name": "1 l s-1 ha-1",
        "mm day-1": "8.640",
        "m3 ha-1 day-1": "86.40",
        "l s-1 ha-1": "1",
        "MJ m-2 day-1": "21.17"
    },
    {
        "name": "1 MJ m-2 day-1",
        "mm day-1": "0.408",
        "m3 ha-1 day-1": "4.082",
        "l s-1 ha-1": "0.047",
        "MJ m-2 day-1": "1"
    }
]
              name mm day-1 m3 ha-1 day-1 l s-1 ha-1 MJ m-2 day-1
0       1 mm day-1        1            10      0.116         2.45
1  1 m3 ha-1 day-1      0.1             1      0.012        0.245
2     1 l s-1 ha-1    8.640         86.40          1        21.17
3   1 MJ m-2 day-1    0.408         4.082      0.047            1

My output of df我的 df 的 output

     MJ m-2 day-1 l s-1 ha-1 m3 ha-1 day-1 mm day-1             name
    0         2.45      0.116            10        1       1 mm day-1
    1        0.245      0.012             1      0.1  1 m3 ha-1 day-1
    2        21.17          1         86.40    8.640     1 l s-1 ha-1
    3            1      0.047         4.082    0.408   1 MJ m-2 day-1

在 Python 中使用 BeautifulSoup 进行表数据抓取

问题描述

2 个解决方案

解决方案1
0 已采纳 2020-05-20 13:50:07

解决方案2
0 2020-05-20 14:03:46

在 Python 中使用 BeautifulSoup 进行表数据抓取

问题描述

2 个解决方案

解决方案1 0 已采纳 2020-05-20 13:50:07

解决方案2 0 2020-05-20 14:03:46

解决方案1
0 已采纳 2020-05-20 13:50:07

解决方案2
0 2020-05-20 14:03:46