简体   繁体   中英

Table data Scraping using BeautifulSoup in Python

Code

Why i'm not getting the all rows while extracting the table data using BeautifulSoup in Python?

Link to website - http://www.fao.org/3/x0490e/x0490e04.htm

table1_rows = table1.find_all('tr')

for tr in table1_rows:
td = tr.find_all('td')
row = [i.text for i in td]
print(row)

Output of above code

print(row)
row = [item.strip() for item in row if str(item)]
row

But i'm getting this output

After doing some changes

for tr in table1_rows:
td = tr.find_all('td')
row = [i.text for i in td]
row = [item.strip() for item in row if str(item)]
print(row)

Then also i'm not getting the output. Can anyone please help me? When i'm printing the row variable out of the loop then i'm not getting the output?

Output

This line:

row = [item.strip() for item in row if str(item)]

should sit inside the for tr in table1_rows loop:

for tr in table1_rows:
    td = tr.find_all('td')
    row = [i.text for i in td]
    row = [item.strip() for item in row if str(item)]
    print(row)

Edit: To collect all rows:

all_rows=[]
for tr in table1_rows:
    td = tr.find_all('td')
    row = [i.text for i in td]
    row = [item.strip() for item in row if str(item)]
    all_rows.append(row)

for row in all_rows:
    print(row)

Edit 2: If the ultimate aim is to get the table data into a dataframe, then that's a one liner job (this replaces the for loop approach):

df=pd.read_html(url)[0]

You obviously need to import pandas first:

import pandas as pd

Output:

print(df)

在此处输入图像描述

It looks like you are at the end of the loop when in the next jupyter block. That table is kinda weirdly formatted too so I went and made this to get the data and column headers as a nested dict list:

import requests
import pandas as pd
import pprint
from bs4 import BeautifulSoup


url = 'http://www.fao.org/3/x0490e/x0490e04.htm'
response = requests.get(url)
soup = BeautifulSoup(response.content)

table = soup.find('table')

def clean(text):
    return text.replace('\r', '').replace('\n', '').replace('  ', '').strip()

# get the column headers
headers = [clean(col.text)
           for col in table.find_all('tr')[1].find_all('td')]
# set the first column to 'name' because it is blank
headers.insert(0, 'name') 

# get the data rows and zip them to the column headers
data = [{col[0]: clean(col[1].text)
         for col in zip(headers, row.find_all('td'))}
        for row in table.find_all('tr')[2::]]

data_list = [headers] + [list(row.values()) for row in data]

# print to list of lists
pprint.pprint(data_list)
# pretty print to json
import json
print(json.dumps(data, indent=4))
# print to dataframe
df = pd.DataFrame(data)
print(df)

Output:

[['name', 'mm day-1', 'm3 ha-1 day-1', 'l s-1 ha-1', 'MJ m-2 day-1'],
 ['1 mm day-1', '1', '10', '0.116', '2.45'],
 ['1 m3 ha-1 day-1', '0.1', '1', '0.012', '0.245'],
 ['1 l s-1 ha-1', '8.640', '86.40', '1', '21.17'],
 ['1 MJ m-2 day-1', '0.408', '4.082', '0.047', '1']]
[
    {
        "name": "1 mm day-1",
        "mm day-1": "1",
        "m3 ha-1 day-1": "10",
        "l s-1 ha-1": "0.116",
        "MJ m-2 day-1": "2.45"
    },
    {
        "name": "1 m3 ha-1 day-1",
        "mm day-1": "0.1",
        "m3 ha-1 day-1": "1",
        "l s-1 ha-1": "0.012",
        "MJ m-2 day-1": "0.245"
    },
    {
        "name": "1 l s-1 ha-1",
        "mm day-1": "8.640",
        "m3 ha-1 day-1": "86.40",
        "l s-1 ha-1": "1",
        "MJ m-2 day-1": "21.17"
    },
    {
        "name": "1 MJ m-2 day-1",
        "mm day-1": "0.408",
        "m3 ha-1 day-1": "4.082",
        "l s-1 ha-1": "0.047",
        "MJ m-2 day-1": "1"
    }
]
              name mm day-1 m3 ha-1 day-1 l s-1 ha-1 MJ m-2 day-1
0       1 mm day-1        1            10      0.116         2.45
1  1 m3 ha-1 day-1      0.1             1      0.012        0.245
2     1 l s-1 ha-1    8.640         86.40          1        21.17
3   1 MJ m-2 day-1    0.408         4.082      0.047            1

My output of df

     MJ m-2 day-1 l s-1 ha-1 m3 ha-1 day-1 mm day-1             name
    0         2.45      0.116            10        1       1 mm day-1
    1        0.245      0.012             1      0.1  1 m3 ha-1 day-1
    2        21.17          1         86.40    8.640     1 l s-1 ha-1
    3            1      0.047         4.082    0.408   1 MJ m-2 day-1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM