简体   繁体   中英

How to scrape websites with Python and beautiful soup

I am trying to scrape results from the bbc sport website. I've got the scores working but when trying to add team names the program prints out none 1-0 none (for example). This is the code:

from bs4 import BeautifulSoup
import urllib.request
import csv 

url = 'http://www.bbc.co.uk/sport/football/teams/derby-county/results'
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page)
for match in soup.select('table.table-stats tr.report'):
    team1 = match.find('span', class_='team-home')
    team2 = match.find('span', class_='team-away')
    score = match.abbr

    print(team1.string, score.string, team2.string)

It looks like you are searching for tags that are not there. For instance class_="team-home teams" is in the html, but class_='team-home' is not. The following code prints the first team name:

tables = soup.find_all("table", class_="table-stats")

tables[0].find("span", class_="team-home teams").text
# u' Birmingham '

Here is a possible solution which gets the home and away team names, the final score, the match date and the competition name via BeautifulSoup and puts it in a DataFrame.

import requests
import pandas as pd
from bs4 import BeautifulSoup

#Get the relevant webpage set the data up for parsing
url = "http://www.bbc.co.uk/sport/football/teams/derby-county/results"
r = requests.get(url)
soup=BeautifulSoup(r.content,"lxml")

#set up a function to parse the "soup" for each category of information and put it in a DataFrame
def get_match_info(soup,tag,class_name,column_name):
    info_array=[]
    for info in soup.find_all('%s'%tag,attrs={'class':'%s'%class_name}):
        info_array.append({'%s'%column_name:info.text})
    return pd.DataFrame(info_array)

#for each category pass the above function the relevant information i.e. tag names
date        = get_match_info(soup,"td","match-date","Date")
home_team   = get_match_info(soup,"span","team-home teams","Home Team")
score       = get_match_info(soup,"span","score","Score")
away_team   = get_match_info(soup,"span","team-away teams","Away Team")
competition = get_match_info(soup,"td","match-competition","Competition")

#Concatenate the DataFrames to present a final table of all the above info 
match_info = pd.concat([date,home_team,score,away_team,competition],ignore_index=False,axis=1)

print match_info

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM