简体   繁体   English

有没有办法过滤或删除 Beautifulsoup 中的数据?

[英]Is there a way to filter or remove data from Beautifulsoup?

I was attempting to web scrap a site with information about games and their schedules.我试图 web 废弃一个包含游戏及其时间表信息的网站。 Initially, I had success in importing all the relevant data into my program;最初,我成功地将所有相关数据导入到我的程序中; however, once the games began playing this changed.然而,一旦比赛开始进行,情况就发生了变化。 The website removed the “time” column from its display which resulted in an uneven number of columns being imported into my program - one less than before as there was no “time” column anymore.该网站从其显示中删除了“时间”列,这导致我的程序中导入的列数不均匀 - 比以前少一列,因为不再有“时间”列。 This caused problems because now when I tried to construct a dataframe out of the collected information it would not work properly due to an unequal amount of entries within each row.这引起了问题,因为现在当我试图从收集到的信息中构建一个 dataframe 时,由于每行中的条目数量不相等,它无法正常工作。 I would like to import only those yet-to-be played.我只想导入那些尚未播放的。

import requests
from bs4 import BeautifulSoup

link = "https://www.espn.com/nfl/schedule/_/week/1/year/2022/seasontype/3"
page = requests.get(link)
soup = BeautifulSoup(page.content,"html.parser")

nfl_resp = soup.find_all('div',class_='ResponsiveTable')
visit = i.find_all(class_="events__col Table__TD")

nfl_list = []
nfl_time_list = []
nfl_location_list = []
visit_list = []

`for i in nfl_resp:`
    location = i.find_all(class_='location__col Table__TD')
    for team in location:
        nfl_location_list.append(team.text)

#I get all the correct stadiums 

for i in nfl_resp:
    time = i.find_all(class_='date__col Table__TD')
    for hour in time:
            nfl_time_list.append(hour.text)

#I get all the correct times

for i in nfl_resp:
    location = i.find_all(class_='location__col Table__TD')
    for team in location:
        nfl_location_list.append(team.text)

#I get all dates correctly

for team in visit:
    visit_list.append(team.text)

#Here's the problem, I get all the games regardless if they started or not.
#It only works if the games are yet to start, I need to run it when the games are running or over too.

You can use this example that parses various information from the ESPN site:您可以使用此示例来解析来自 ESPN 站点的各种信息:

import pandas as pd
import requests
from bs4 import BeautifulSoup

url = "https://www.espn.com/nfl/schedule/_/week/1/year/2022/seasontype/3"

soup = BeautifulSoup(requests.get(url).content, "html.parser")

all_data = []
for row in soup.select(".Table__TR:has(.AnchorLink)"):
    data = [t.text for t in row.select(".AnchorLink:not(:has(img))")]
    networks = [
        n["alt"] if n.name == "img" else n.text
        for n in row.select(".network-container img, .network-container .network-name")
    ]
    date = row.find_previous(class_="Table__Title").text.strip()
    all_data.append([*data, networks, date])

df = pd.DataFrame(
    all_data,
    columns=["Team 1", "Team 2", "Time", "Tickets", "Stadium", "Networks", "Date"],
)
print(df)

Prints:印刷:

        Team 1         Team 2     Time                 Tickets                             Stadium            Networks                        Date
0      Seattle  San Francisco  4:30 PM  Tickets as low as $138     Levi's Stadium, Santa Clara, CA               [FOX]  Saturday, January 14, 2023
1  Los Angeles   Jacksonville  8:15 PM  Tickets as low as $138   TIAA Bank Field, Jacksonville, FL               [NBC]  Saturday, January 14, 2023
2        Miami        Buffalo  1:00 PM  Tickets as low as $114  Highmark Stadium, Orchard Park, NY               [CBS]    Sunday, January 15, 2023
3     New York      Minnesota  4:30 PM  Tickets as low as $116  U.S. Bank Stadium, Minneapolis, MN               [FOX]    Sunday, January 15, 2023
4    Baltimore     Cincinnati  8:15 PM  Tickets as low as $171      Paycor Stadium, Cincinnati, OH               [NBC]    Sunday, January 15, 2023
5       Dallas      Tampa Bay  8:15 PM  Tickets as low as $163    Raymond James Stadium, Tampa, FL  [ESPN, ABC, ESPN+]    Monday, January 16, 2023

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM