简体   繁体   English

使用BS4将HTML表格爬到CSV以便与Pandas一起使用

[英]Scraping HTML tables to CSV's using BS4 for use with Pandas

I have begun a pet-project creating what is essentially an indexed compilation of a plethora of NFL statistics with a nice simple GUI. 我已经开始一个宠物项目,用一个很好的简单GUI创建本质上是大量NFL统计数据的索引编制。 Fortunately, the site https://www.pro-football-reference.com has all the data you can imagine in the form of tables which can be exported to CSV format on the site and manually copied/pasted. 幸运的是,该网站https://www.pro-football-reference.com以表格的形式提供了您可以想象的所有数据,这些数据可以在网站上导出为CSV格式并手动复制/粘贴。 I started doing this, and then using the Pandas library, began reading the CSVs into DataFrames to make use of the data. 我开始这样做,然后使用Pandas库,开始将CSV读取到DataFrames中以利用数据。

This works great, however, manually fetching all this data is quite tedious, so I decided to attempt to create a web scraper that can scrape HTML tables and convert them into a usable CSV format. 效果很好,但是,手动获取所有这些数据非常繁琐,因此我决定尝试创建一个可以抓取HTML表并将其转换为可用CSV格式的网络抓取器。 I am struggling, specifically to isolate individual tables but also with having the CSV that is produced render in a readable/usable format. 我正在努力,特别是要隔离单个表,还要使生成的CSV以可读/可用的格式呈现。

Here is what the scraper looks like right now: 这是目前刮板的外观:

from bs4 import BeautifulSoup
import requests
import csv

def table_Scrape():
    url = 'https://www.pro-football-reference.com/teams/nwe/2008.htm'
    req = requests.get(url)
    soup = BeautifulSoup(req.text, 'html.parser')
    table = soup.select_one('table.stats_table')
    headers = [th.text.encode("utf-8") for th in table.select("tr th")]
    with open("out.csv", "w", encoding='utf-8') as f:
        wr = csv.writer(f)
        wr.writerow(headers)
        wr.writerows([
            [td.text.encode("utf-8") for td in row.find_all("td")]
            for row in table.select("tr + tr")
        ])    
table_Scrape()

This does properly send the request to the URL, but doesn't fetch the data I am looking for which is 'Rushing_and_Receiving'. 这确实可以将请求正确发送到URL,但不会获取我正在查找的数据“ Rushing_and_Receiving”。 Instead, it fetches the first table on the page 'Team Stats and Ranking'. 相反,它将获取“ Team Stats and Rank”页面上的第一张表。 It also renders the CSV in a rather ugly/not useful format like so: 它还以一种非常丑陋/无用的格式呈现CSV,如下所示:

b'',b'',b'',b'Tot Yds & TO',b'',b'',b'Passing',b'Rushing',b'Penalties',b'',b'Average Drive',b'Player',b'PF',b'Yds',b'Ply',b'Y/P',b'TO',b'FL',b'1stD',b'Cmp',b'Att',b'Yds',b'TD',b'Int',b'NY/A',b'1stD',b'Att',b'Yds',b'TD',b'Y/A',b'1stD',b'Pen',b'Yds',b'1stPy',b'#Dr',b'Sc%',b'TO%',b'Start',b'Time',b'Plays',b'Yds',b'Pts',b'Team Stats',b'Opp. Stats',b'Lg Rank Offense',b'Lg Rank Defense'

b'309',b'4944',b'920',b'5.4',b'22',b'8',b'268',b'288',b'474',b'3222',b'27',b'14',b'6.4',b'176',b'415',b'1722',b'8',b'4.1',b'78',b'81',b'636',b'14',b'170',b'30.6',b'12.9',b'Own 27.8',b'2:38',b'5.5',b'29.1',b'1.74'
b'8',b'5',b'',b'',b'8',b'13',b'1',b'',b'12',b'12',b'13',b'5',b'13',b'',b'4',b'6',b'4',b'7',b'',b'',b'',b'',b'',b'1',b'21',b'2',b'3',b'2',b'5',b'4'
b'8',b'10',b'',b'',b'20',b'20',b'7',b'',b'7',b'11',b'31',b'15',b'21',b'',b'11',b'15',b'4',b'15',b'',b'',b'',b'',b'',b'24',b'16',b'5',b'13',b'14',b'15',b'11'

I know my issue with fetching the correct table lies within the line: 我知道获取正确表的问题就在这行内:

table = soup.select_one('table.stats_table')

I am what I would still consider a novice in Python, so if someone can help me be able to query and parse a specific table with BS4 into CSV format I would be beyond appreciative! 我仍然是Python的新手,因此,如果有人可以帮助我将BS4特定表查询并将其解析为CSV格式,我将不胜感激!

Thanks in advance! 提前致谢!

The pandas solution didn't work for me due to the ajax load, but you can see in the console the URL each table is loading from, and request to it directly. 由于ajax加载,pandas解决方案对我不起作用,但是您可以在控制台中看到每个表从其加载的URL,并直接向其请求。 In this case, the URL is: https://widgets.sports-reference.com/wg.fcgi?css=1&site=pfr&url=%2Fteams%2Fnwe%2F2008.htm&div=div_rushing_and_receiving 在这种情况下,URL为: https://widgets.sports-reference.com/wg.fcgi?css=1&site=pfr&url=%2Fteams%2Fnwe%2F2008.htm&div=div_rushing_and_receiving : https://widgets.sports-reference.com/wg.fcgi?css=1&site=pfr&url=%2Fteams%2Fnwe%2F2008.htm&div=div_rushing_and_receiving url https://widgets.sports-reference.com/wg.fcgi?css=1&site=pfr&url=%2Fteams%2Fnwe%2F2008.htm&div=div_rushing_and_receiving

You can then get the table directly using its id rushing_and_receiving . 然后,您可以使用其ID rushing_and_receiving直接获取该表。

This seems to work. 这似乎有效。

from bs4 import BeautifulSoup
import requests
import csv

def table_Scrape():
    url = 'https://widgets.sports-reference.com/wg.fcgi?css=1&site=pfr&url=%2Fteams%2Fnwe%2F2008.htm&div=div_rushing_and_receiving'
    req = requests.get(url)
    soup = BeautifulSoup(req.text, 'html.parser')
    table = soup.find('table', id='rushing_and_receiving')
    headers = [th.text for th in table.findAll("tr")[1]]
    body = table.find('tbody')
    with open("out.csv", "w", encoding='utf-8') as f:
        wr = csv.writer(f)
        wr.writerow(headers)
        for data_row in body.findAll("tr"):
            th = data_row.find('th')
            wr.writerow([th.text] + [td.text for td in data_row.findAll("td")])

table_Scrape()

I would bypass beautiful soup altogether since pandas works well for this site. 由于熊猫在该网站上效果很好,因此我将完全绕开美丽的汤。 (at least the first 4 tables I glossed over) Documentation here (至少我掩盖了前4个表格) 此处的文档

import pandas as pd

url = 'https://www.pro-football-reference.com/teams/nwe/2008.htm'
data = pd.read_html(url)
# data is now a list of dataframes (spreadsheets) one dataframe for each table in the page
data[0].to_csv('somefile.csv')

I wish I could credit both of these answers as correct, as they are both useful, but alas, the second answer using BeautifulSoup is the better answer since it allows for the isolation of specific tables, whereas the nature of the way the site is structured limits the effectiveness of the 'read_html' method in Pandas. 我希望我可以认为这两个答案都是正确的,因为它们都很有用,但是,las,使用BeautifulSoup的第二个答案是更好的答案,因为它允许隔离特定的表,而网站的结构方式是自然的限制了Pandas中“ read_html”方法的有效性。

Thanks to everyone who responded! 感谢大家的回应!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM