繁体   English   中英

在 Python 中使用 BeautifulSoup 进行表数据抓取

[英]Table data Scraping using BeautifulSoup in Python

代码

为什么在 Python 中使用 BeautifulSoup 提取表数据时我没有获取所有行?

链接到网站 - http://www.fao.org/3/x0490e/x0490e04.htm

table1_rows = table1.find_all('tr')

for tr in table1_rows:
td = tr.find_all('td')
row = [i.text for i in td]
print(row)

上述代码的 Output

print(row)
row = [item.strip() for item in row if str(item)]
row

但我得到了这个 output

做了一些改动后

for tr in table1_rows:
td = tr.find_all('td')
row = [i.text for i in td]
row = [item.strip() for item in row if str(item)]
print(row)

然后我也没有得到 output。 谁能帮帮我吗? 当我从循环中打印行变量时,我没有得到 output?

Output

这一行:

row = [item.strip() for item in row if str(item)]

for tr in table1_rows

for tr in table1_rows:
    td = tr.find_all('td')
    row = [i.text for i in td]
    row = [item.strip() for item in row if str(item)]
    print(row)

编辑:收集所有行:

all_rows=[]
for tr in table1_rows:
    td = tr.find_all('td')
    row = [i.text for i in td]
    row = [item.strip() for item in row if str(item)]
    all_rows.append(row)

for row in all_rows:
    print(row)

编辑 2:如果最终目标是将表数据放入 dataframe,那么这是一项单行作业(这取代了 for 循环方法):

df=pd.read_html(url)[0]

您显然需要先导入 pandas :

import pandas as pd

Output:

print(df)

在此处输入图像描述

在下一个 jupyter 块中时,您似乎处于循环的末尾。 该表的格式也有点奇怪,所以我做了这个来获取数据和列标题作为嵌套的 dict 列表:

import requests
import pandas as pd
import pprint
from bs4 import BeautifulSoup


url = 'http://www.fao.org/3/x0490e/x0490e04.htm'
response = requests.get(url)
soup = BeautifulSoup(response.content)

table = soup.find('table')

def clean(text):
    return text.replace('\r', '').replace('\n', '').replace('  ', '').strip()

# get the column headers
headers = [clean(col.text)
           for col in table.find_all('tr')[1].find_all('td')]
# set the first column to 'name' because it is blank
headers.insert(0, 'name') 

# get the data rows and zip them to the column headers
data = [{col[0]: clean(col[1].text)
         for col in zip(headers, row.find_all('td'))}
        for row in table.find_all('tr')[2::]]

data_list = [headers] + [list(row.values()) for row in data]

# print to list of lists
pprint.pprint(data_list)
# pretty print to json
import json
print(json.dumps(data, indent=4))
# print to dataframe
df = pd.DataFrame(data)
print(df)

Output:

[['name', 'mm day-1', 'm3 ha-1 day-1', 'l s-1 ha-1', 'MJ m-2 day-1'],
 ['1 mm day-1', '1', '10', '0.116', '2.45'],
 ['1 m3 ha-1 day-1', '0.1', '1', '0.012', '0.245'],
 ['1 l s-1 ha-1', '8.640', '86.40', '1', '21.17'],
 ['1 MJ m-2 day-1', '0.408', '4.082', '0.047', '1']]
[
    {
        "name": "1 mm day-1",
        "mm day-1": "1",
        "m3 ha-1 day-1": "10",
        "l s-1 ha-1": "0.116",
        "MJ m-2 day-1": "2.45"
    },
    {
        "name": "1 m3 ha-1 day-1",
        "mm day-1": "0.1",
        "m3 ha-1 day-1": "1",
        "l s-1 ha-1": "0.012",
        "MJ m-2 day-1": "0.245"
    },
    {
        "name": "1 l s-1 ha-1",
        "mm day-1": "8.640",
        "m3 ha-1 day-1": "86.40",
        "l s-1 ha-1": "1",
        "MJ m-2 day-1": "21.17"
    },
    {
        "name": "1 MJ m-2 day-1",
        "mm day-1": "0.408",
        "m3 ha-1 day-1": "4.082",
        "l s-1 ha-1": "0.047",
        "MJ m-2 day-1": "1"
    }
]
              name mm day-1 m3 ha-1 day-1 l s-1 ha-1 MJ m-2 day-1
0       1 mm day-1        1            10      0.116         2.45
1  1 m3 ha-1 day-1      0.1             1      0.012        0.245
2     1 l s-1 ha-1    8.640         86.40          1        21.17
3   1 MJ m-2 day-1    0.408         4.082      0.047            1

我的 df 的 output

     MJ m-2 day-1 l s-1 ha-1 m3 ha-1 day-1 mm day-1             name
    0         2.45      0.116            10        1       1 mm day-1
    1        0.245      0.012             1      0.1  1 m3 ha-1 day-1
    2        21.17          1         86.40    8.640     1 l s-1 ha-1
    3            1      0.047         4.082    0.408   1 MJ m-2 day-1

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM