简体   繁体   中英

Scraping and Looping Meta Tags With Beautiful Soup

Below is a web scraper that uses beautiful soup to scrape a team roster from this website . Each column of data is put into an array where it is then looped through to a CSV file. I would like to scrape the team name ('team' in code), but I'm struggling to incorporate the meta tag (see below for html code) into my CSV writerow loop.

<meta property="og:site_name" content="Tampa Bay Rays" />

I believe the issue is the length of values in the 'Team' array does not match the length of values in the other columns. For example, my current code prints arrays that look like this:

[Player A, Player B, Player C]
[46,36,33]
[Tampa Bay Rays]

But I need the team array (last array) to match the length of the first two arrays like this:

[Player A, Player B, Player C]
[46,36,33]
[Tampa Bay Rays, Tampa Bay Rays, Tampa Bay Rays]

Would anyone know how to make this meta tag adjustment within my writerow csv loop? Thanks in advance!

import requests
import csv
from bs4 import BeautifulSoup

page=requests.get('http://m.rays.mlb.com/roster/')
soup=BeautifulSoup(page.text, 'html.parser')

#Remove Unwanted Links
last_links=soup.find(class_='nav-tabset-container')
last_links.decompose()
side_links=soup.find(class_='column secondary span-5 right')
side_links.decompose()

#Generate CSV
f=csv.writer(open('MLB_Active_Roster.csv','w',newline=''))
f.writerow(['Name','Number','Hand','Height','Weight','DOB','Team'])

#Find Player Name Links
player_list=soup.find(class_='layout layout-roster')
player_list_items=player_list.find_all('a')

#Extract Player Name Text
names=[player_name.contents[0] for player_name in player_list_items]

#Find Player Number
number_list=soup.find(class_='layout layout-roster')
number_list_items=number_list.find_all('td',index='0')


#Extract Player Number Text
number=[player_number.contents[0] for player_number in number_list_items]

#Find B/T
hand_list=soup.find(class_='layout layout-roster')
hand_list_items=hand_list.find_all('td',index='3')

#Extract B/T
handedness=[player_hand.contents[0] for player_hand in hand_list_items]

#Find Height
height_list=soup.find(class_='layout layout-roster')
height_list_items=hand_list.find_all('td',index='4')

#Extract Height
height=[player_height.contents[0] for player_height in height_list_items]

#Find Weight
weight_list=soup.find(class_='layout layout-roster')
weight_list_items=weight_list.find_all('td',index='5')

#Extract Weight
weight=[player_weight.contents[0] for player_weight in weight_list_items]

#Find DOB
DOB_list=soup.find(class_='layout layout-roster')
DOB_list_items=DOB_list.find_all('td',index='6')

#Extract DOB
DOB=[player_DOB.contents[0] for player_DOB in DOB_list_items]

#Find Team Name
team_list=soup.find('meta',property='og:site_name')
Team=[team_name.contents[0] for team_name in team_list]
print(Team)

#Loop Excel Rows
for i in range(len(names)):
    f.writerow([names[i],number[i],handedness[i],height[i],weight[i],DOB[i],Team[i]])

The problem is in the way you are using the find function.

instead of using this:

player_list=soup.find(class_='layout layout-roster')

you should use this:

player_list=soup.find({"class":"layout layout-roster"})

(you should apply this change to all the find functions)


Your end script should look like this:

side_links=soup.find({"class":'column secondary span-5 right'})
side_links.decompose()

#Generate CSV
f=csv.writer(open('MLB_Active_Roster.csv','w',newline=''))
f.writerow(['Name','Number','Hand','Height','Weight','DOB','Team'])

#Find Player Name Links
player_list=soup.find({"class":'layout layout-roster'})
player_list_items=player_list.find_all('a')

#Extract Player Name Text
names=[player_name.contents[0] for player_name in player_list_items]

#Find Player Number
number_list=soup.find({"class":'layout layout-roster'})
number_list_items=number_list.find_all('td',{"index":'0'})


#Extract Player Number Text
number=[player_number.contents[0] for player_number in number_list_items]

#Find B/T
hand_list=soup.find({"class":'layout layout-roster'})
hand_list_items=hand_list.find_all('td',{"index":'3'})

#Extract B/T
handedness=[player_hand.contents[0] for player_hand in hand_list_items]

#Find Height
height_list=soup.find({"class":'layout layout-roster'})
height_list_items=hand_list.find_all('td',{"index":'4'})

#Extract Height
height=[player_height.contents[0] for player_height in height_list_items]

#Find Weight
weight_list=soup.find({"class":'layout layout-roster'})
weight_list_items=weight_list.find_all('td',{"index":'5'})

#Extract Weight
weight=[player_weight.contents[0] for player_weight in weight_list_items]

#Find DOB
DOB_list=soup.find({"class":'layout layout-roster'})
DOB_list_items=DOB_list.find_all('td',{"index":'6'})

#Extract DOB
DOB=[player_DOB.contents[0] for player_DOB in DOB_list_items]

#Find Team Name
team_list=soup.find('meta',{"property":'og:site_name'})
Team=[team_name.contents[0] for team_name in team_list]
print(Team)

#Loop Excel Rows
for i in range(len(names)):
    f.writerow([names[i],number[i],handedness[i],height[i],weight[i],DOB[i],Team[i]])

The change is simple, change the part #Find Team Name to:

#Find Team Name
team_list=soup.find('meta',property='og:site_name')
Team = [team_list['content'] for _ in names]

Complete program:

import requests
import csv
from bs4 import BeautifulSoup

page=requests.get('http://m.rays.mlb.com/roster/')
soup=BeautifulSoup(page.text, 'html.parser')

#Remove Unwanted Links
last_links=soup.find(class_='nav-tabset-container')
last_links.decompose()
side_links=soup.find(class_='column secondary span-5 right')
side_links.decompose()

#Generate CSV
f=csv.writer(open('MLB_Active_Roster.csv','w',newline=''))
f.writerow(['Name','Number','Hand','Height','Weight','DOB','Team'])

#Find Player Name Links
player_list=soup.find(class_='layout layout-roster')
player_list_items=player_list.find_all('a')

#Extract Player Name Text
names=[player_name.contents[0] for player_name in player_list_items]

#Find Player Number
number_list=soup.find(class_='layout layout-roster')
number_list_items=number_list.find_all('td',index='0')


#Extract Player Number Text
number=[player_number.contents[0] for player_number in number_list_items]

#Find B/T
hand_list=soup.find(class_='layout layout-roster')
hand_list_items=hand_list.find_all('td',index='3')

#Extract B/T
handedness=[player_hand.contents[0] for player_hand in hand_list_items]

#Find Height
height_list=soup.find(class_='layout layout-roster')
height_list_items=hand_list.find_all('td',index='4')

#Extract Height
height=[player_height.contents[0] for player_height in height_list_items]

#Find Weight
weight_list=soup.find(class_='layout layout-roster')
weight_list_items=weight_list.find_all('td',index='5')

#Extract Weight
weight=[player_weight.contents[0] for player_weight in weight_list_items]

#Find DOB
DOB_list=soup.find(class_='layout layout-roster')
DOB_list_items=DOB_list.find_all('td',index='6')

#Extract DOB
DOB=[player_DOB.contents[0] for player_DOB in DOB_list_items]

#Find Team Name
team_list=soup.find('meta',property='og:site_name')
Team = [team_list['content'] for _ in names]

for i in range(len(names)):
    f.writerow([names[i],number[i],handedness[i],height[i],weight[i],DOB[i],Team[i]])

The result is in CSV file:

Name,Number,Hand,Height,Weight,DOB,Team
Jose Alvarado,46,L/L,"6'2""",245lbs,5/21/95,Tampa Bay Rays
Matt Andriese,35,R/R,"6'2""",225lbs,8/28/89,Tampa Bay Rays
Chris Archer,22,R/R,"6'2""",195lbs,9/26/88,Tampa Bay Rays
Diego Castillo,63,R/R,"6'3""",240lbs,1/18/94,Tampa Bay Rays
Nathan Eovaldi,24,R/R,"6'2""",225lbs,2/13/90,Tampa Bay Rays
Chih-Wei Hu,58,R/R,"6'0""",220lbs,11/4/93,Tampa Bay Rays
Andrew Kittredge,36,R/R,"6'1""",200lbs,3/17/90,Tampa Bay Rays
Adam Kolarek,56,L/L,"6'3""",205lbs,1/14/89,Tampa Bay Rays
Sergio Romo,54,R/R,"5'11""",185lbs,3/4/83,Tampa Bay Rays
Jaime Schultz,57,R/R,"5'10""",200lbs,6/20/91,Tampa Bay Rays
Blake Snell,4,L/L,"6'4""",200lbs,12/4/92,Tampa Bay Rays
Ryne Stanek,55,R/R,"6'4""",215lbs,7/26/91,Tampa Bay Rays
Hunter Wood,61,R/R,"6'1""",165lbs,8/12/93,Tampa Bay Rays
Ryan Yarbrough,48,R/L,"6'5""",205lbs,12/31/91,Tampa Bay Rays
Wilson Ramos,40,R/R,"6'1""",245lbs,8/10/87,Tampa Bay Rays
Jesus Sucre,45,R/R,"6'0""",200lbs,4/30/88,Tampa Bay Rays
Jake Bauers,9,L/L,"6'1""",195lbs,10/6/95,Tampa Bay Rays
Ji-Man Choi,26,L/R,"6'1""",230lbs,5/19/91,Tampa Bay Rays
C.J. Cron,44,R/R,"6'4""",235lbs,1/5/90,Tampa Bay Rays
Matt Duffy,5,R/R,"6'2""",170lbs,1/15/91,Tampa Bay Rays
Adeiny Hechavarria,11,R/R,"6'0""",195lbs,4/15/89,Tampa Bay Rays
Daniel Robertson,28,R/R,"5'11""",200lbs,3/22/94,Tampa Bay Rays
Joey Wendle,18,L/R,"6'1""",190lbs,4/26/90,Tampa Bay Rays
Carlos Gomez,27,R/R,"6'3""",220lbs,12/4/85,Tampa Bay Rays
Kevin Kiermaier,39,L/R,"6'1""",215lbs,4/22/90,Tampa Bay Rays
Mallex Smith,0,L/R,"5'10""",180lbs,5/6/93,Tampa Bay Rays

There is so much duplication in your code. Try to avoid copy-and-paste programming.

That being said, you can make a list out of the same items: ['foo'] * 3 gives ['foo', 'foo', 'foo'] . This is handy for the team name, which is the same for all team members.

You can use zip() and writerows() to write all lists to the CSV in one line of code.

import requests
import csv
from bs4 import BeautifulSoup

page = requests.get('http://m.rays.mlb.com/roster/')
soup = BeautifulSoup(page.text, 'html.parser')

soup.find(class_='nav-tabset-container').decompose()
soup.find(class_='column secondary span-5 right').decompose()

roster = soup.find(class_='layout layout-roster')
names = [n.contents[0] for n in roster.find_all('a')]
number = [n.contents[0] for n in roster.find_all('td', index='0')]
handedness = [n.contents[0] for n in roster.find_all('td', index='3')]
height = [n.contents[0] for n in roster.find_all('td', index='4')]
weight = [n.contents[0] for n in roster.find_all('td', index='5')]
DOB = [n.contents[0] for n in roster.find_all('td', index='6')]
team = [soup.find('meta',property='og:site_name')['content']] * len(names)

with open('MLB_Active_Roster.csv', 'w', newline='') as fp:
    f = csv.writer(fp)
    f.writerow(['Name','Number','Hand','Height','Weight','DOB','Team'])
    f.writerows(zip(names, number, handedness, height, weight, DOB, team))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM