简体   繁体   中英

Scraping HTML tables to CSV's using BS4 for use with Pandas

I have begun a pet-project creating what is essentially an indexed compilation of a plethora of NFL statistics with a nice simple GUI. Fortunately, the site https://www.pro-football-reference.com has all the data you can imagine in the form of tables which can be exported to CSV format on the site and manually copied/pasted. I started doing this, and then using the Pandas library, began reading the CSVs into DataFrames to make use of the data.

This works great, however, manually fetching all this data is quite tedious, so I decided to attempt to create a web scraper that can scrape HTML tables and convert them into a usable CSV format. I am struggling, specifically to isolate individual tables but also with having the CSV that is produced render in a readable/usable format.

Here is what the scraper looks like right now:

from bs4 import BeautifulSoup
import requests
import csv

def table_Scrape():
    url = 'https://www.pro-football-reference.com/teams/nwe/2008.htm'
    req = requests.get(url)
    soup = BeautifulSoup(req.text, 'html.parser')
    table = soup.select_one('table.stats_table')
    headers = [th.text.encode("utf-8") for th in table.select("tr th")]
    with open("out.csv", "w", encoding='utf-8') as f:
        wr = csv.writer(f)
        wr.writerow(headers)
        wr.writerows([
            [td.text.encode("utf-8") for td in row.find_all("td")]
            for row in table.select("tr + tr")
        ])    
table_Scrape()

This does properly send the request to the URL, but doesn't fetch the data I am looking for which is 'Rushing_and_Receiving'. Instead, it fetches the first table on the page 'Team Stats and Ranking'. It also renders the CSV in a rather ugly/not useful format like so:

b'',b'',b'',b'Tot Yds & TO',b'',b'',b'Passing',b'Rushing',b'Penalties',b'',b'Average Drive',b'Player',b'PF',b'Yds',b'Ply',b'Y/P',b'TO',b'FL',b'1stD',b'Cmp',b'Att',b'Yds',b'TD',b'Int',b'NY/A',b'1stD',b'Att',b'Yds',b'TD',b'Y/A',b'1stD',b'Pen',b'Yds',b'1stPy',b'#Dr',b'Sc%',b'TO%',b'Start',b'Time',b'Plays',b'Yds',b'Pts',b'Team Stats',b'Opp. Stats',b'Lg Rank Offense',b'Lg Rank Defense'

b'309',b'4944',b'920',b'5.4',b'22',b'8',b'268',b'288',b'474',b'3222',b'27',b'14',b'6.4',b'176',b'415',b'1722',b'8',b'4.1',b'78',b'81',b'636',b'14',b'170',b'30.6',b'12.9',b'Own 27.8',b'2:38',b'5.5',b'29.1',b'1.74'
b'8',b'5',b'',b'',b'8',b'13',b'1',b'',b'12',b'12',b'13',b'5',b'13',b'',b'4',b'6',b'4',b'7',b'',b'',b'',b'',b'',b'1',b'21',b'2',b'3',b'2',b'5',b'4'
b'8',b'10',b'',b'',b'20',b'20',b'7',b'',b'7',b'11',b'31',b'15',b'21',b'',b'11',b'15',b'4',b'15',b'',b'',b'',b'',b'',b'24',b'16',b'5',b'13',b'14',b'15',b'11'

I know my issue with fetching the correct table lies within the line:

table = soup.select_one('table.stats_table')

I am what I would still consider a novice in Python, so if someone can help me be able to query and parse a specific table with BS4 into CSV format I would be beyond appreciative!

Thanks in advance!

The pandas solution didn't work for me due to the ajax load, but you can see in the console the URL each table is loading from, and request to it directly. In this case, the URL is: https://widgets.sports-reference.com/wg.fcgi?css=1&site=pfr&url=%2Fteams%2Fnwe%2F2008.htm&div=div_rushing_and_receiving

You can then get the table directly using its id rushing_and_receiving .

This seems to work.

from bs4 import BeautifulSoup
import requests
import csv

def table_Scrape():
    url = 'https://widgets.sports-reference.com/wg.fcgi?css=1&site=pfr&url=%2Fteams%2Fnwe%2F2008.htm&div=div_rushing_and_receiving'
    req = requests.get(url)
    soup = BeautifulSoup(req.text, 'html.parser')
    table = soup.find('table', id='rushing_and_receiving')
    headers = [th.text for th in table.findAll("tr")[1]]
    body = table.find('tbody')
    with open("out.csv", "w", encoding='utf-8') as f:
        wr = csv.writer(f)
        wr.writerow(headers)
        for data_row in body.findAll("tr"):
            th = data_row.find('th')
            wr.writerow([th.text] + [td.text for td in data_row.findAll("td")])

table_Scrape()

I would bypass beautiful soup altogether since pandas works well for this site. (at least the first 4 tables I glossed over) Documentation here

import pandas as pd

url = 'https://www.pro-football-reference.com/teams/nwe/2008.htm'
data = pd.read_html(url)
# data is now a list of dataframes (spreadsheets) one dataframe for each table in the page
data[0].to_csv('somefile.csv')

I wish I could credit both of these answers as correct, as they are both useful, but alas, the second answer using BeautifulSoup is the better answer since it allows for the isolation of specific tables, whereas the nature of the way the site is structured limits the effectiveness of the 'read_html' method in Pandas.

Thanks to everyone who responded!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM