简体   繁体   中英

How to extract a table from a website using BeautifulSoup?

I'm wanting to extract the FIPS code for each county in Louisiana from this website using beautiful soup and create a Pandas Dataframe: https://www.nrcs.usda.gov/wps/portal/nrcs/detail/la/technical/cp/?cid=nrcs143_013697

The columns would be FIPS, Name, and State. I've tried finding by tr, td, and table when I inspect the element, but I don't know how to single out just the main data and then put it into a pandas dataframe. Once I find the specific table, it should be easy to do something like:

if state == 'LA':
     # put data into a dataframe

import requests
from bs4 import BeautifulSoup

url = "https://www.nrcs.usda.gov/wps/portal/nrcs/detail/la/technical/cp/?cid=nrcs143_013697"
html_text = requests.get(url).text
soup = BeautifulSoup(html_text, 'html.parser')
# print(soup)
for county in soup.find_all('table'):
    print(county.text)

You can select <table> with class="data" and then use pd.read_html . For example:

import requests
import pandas as pd
from bs4 import BeautifulSoup

url = "https://www.nrcs.usda.gov/wps/portal/nrcs/detail/la/technical/cp/?cid=nrcs143_013697"

soup = BeautifulSoup(requests.get(url).content, "html.parser")
df = pd.read_html(str(soup.select_one(".data")))[0]
# filter State == 'LA'
print(df[df.State == "LA"].head())

Prints:

       FIPS        Name State
1109  22001      Acadia    LA
1110  22003       Allen    LA
1111  22005   Ascension    LA
1112  22007  Assumption    LA
1113  22009   Avoyelles    LA

There is one table so can iterate over the <tr> elements in that one table.

If want a data frame to include only one particular state then can filter it before adding to a data frame, or filter the data frame of all data for a subset data frame.

import requests
from bs4 import BeautifulSoup
import pandas as pd

url = "https://www.nrcs.usda.gov/wps/portal/nrcs/detail/la/technical/cp/?cid=nrcs143_013697"
html_text = requests.get(url).text
soup = BeautifulSoup(html_text, 'html.parser')
data = []
for tr in soup.find('table', class_='data').find_all('tr'):
    row = [td.text for td in tr.find_all('td')]
    # If want to filter out all except LA then can do that here
    if len(row) == 3 and row[2] == 'LA':
        data.append(row)
df = pd.DataFrame(data, columns=['FIPS', 'Name', 'State'])
print(df)

Output:

     FIPS          Name State
0   22001        Acadia    LA
1   22003         Allen    LA
2   22005     Ascension    LA
3   22007    Assumption    LA
4   22009     Avoyelles    LA
..    ...           ...   ...
63  22127          Winn    LA

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM