简体   繁体   中英

How to remove characters in pandas data frame when web scraping with Python?

I am trying to web scrape, by using Python 3, a table off of this website into a.csv file: 2011-2012 NBA National Schedule

The table starts out like:

                Revised Schedule                    Original Schedule

Date            Time      Game                Net   Time      Game                  Net
Sun., 12/25/11  12 PM     BOS (1) at NY (1)   TNT   12 PM     BOS (7) at NY (7)     ESPN
Sun., 12/25/11  2:30 PM   MIA (1) at DAL (1)  ABC   2:30 PM   MIA (8) at DAL (5)    ABC
Sun., 12/25/11  5 PM      CHI (1) at LAL (1)  ABC   5 PM      CHI (6) at LAL (9)    ABC
Sun., 12/25/11  8 PM      ORL (1) at OKC (1)  ESPN  no game   no game               no game
Sun., 12/25/11  10:30 PM  LAC (1) at GS (1)   ESPN  no game   no game               no game
Tue., 12/27/11  8 PM      BOS (2) at MIA (2)  TNT   no game   no game               no game
Tue., 12/27/11  10:30 PM  UTA (1) at LAL (2)  TNT   no game   no game               no game

I am only interested in the revised schedule which is the first 4 columns. The output I want in a.csv file looks like this:

.csv 文件中的输出

I am using these packages:

import re
import requests
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
from itertools import groupby

This is the code I have done to match the format I want:

df = pd.read_html("https://www.sportsmediawatch.com/2011/12/revised-2011-12-nba-national-tv-schedule/", header=0)[0]

revisedCols = ['Date'] + [ col for col in df.columns if 'Revised' in col ]
df = df[revisedCols]

df.columns = df.iloc[0,:]

df = df.iloc[1:,:].reset_index(drop=True)


# Format Date to m/d/y
df['Date'] = np.where(df.Date.str.startswith(('10/', '11/', '12/')), df.Date + ' 11', df.Date + ' 12')
df['Date']=pd.to_datetime(df['Date'])
df['Date']=df['Date'].dt.strftime('%m/%d/%Y')

# Split the Game column
df[['Away','Home']] = df.Game.str.split('at',expand=True)   


# Final dataframe with desired columns
df = df[['Date','Time','Away','Home','Net']]

df.columns = ['Date', 'Time', 'Away', 'Home', 'Network']

print(df)

Output:

           Date      Time      Away        Home Network
0    12/25/2011     12 PM   BOS (1)      NY (1)     TNT
1    12/25/2011   2:30 PM   MIA (1)     DAL (1)     ABC
2    12/25/2011      5 PM   CHI (1)     LAL (1)     ABC
3    12/25/2011      8 PM   ORL (1)     OKC (1)    ESPN
4    12/25/2011  10:30 PM   LAC (1)      GS (1)    ESPN
5    12/27/2011      8 PM   BOS (2)     MIA (2)     TNT
6    12/27/2011  10:30 PM   UTA (1)     LAL (2)     TNT

I noticed there is (1), (2), etc. next to each team name in the Away and Home columns. How do I implement the scraper to removed the (1), (2), etc. next to each team name in the Away and Home columns?

you can use str.replace with the parenthesis and the number(s) and also str.strip at it seems there are some whitespaces at the beginning or the end:

df['Away'] = df['Away'].str.replace('\(\d*\)', '').str.strip()
df['Home'] = df['Home'].str.replace('\(\d*\)', '').str.strip()
print (df.head())
         Date      Time Away Home Network
0  12/25/2011     12 PM  BOS   NY     TNT
1  12/25/2011   2:30 PM  MIA  DAL     ABC
2  12/25/2011      5 PM  CHI  LAL     ABC
3  12/25/2011      8 PM  ORL  OKC    ESPN
4  12/25/2011  10:30 PM  LAC   GS    ESPN
import re
import numpy as np
import pandas as pd

dataset = pd.read_csv("Dataset.csv")
dataset.rename(columns={'Country(or dependent territory)': 'Country'}, inplace = True)
dataset.rename(columns={'% of worldpopulation': 'Percentage of World Population'}, inplace = True)
dataset.rename(columns={'Total Area': 'Total Area (km2)'}, inplace = True)

You can add this code after split the Game column

df['Away']=df['Away'].astype(str).str[0:-4]
df['Home']=df['Home'].astype(str).str[0:-4]

Instead of spliting the Game column at 'at ', don't specifically state a delimiter. .split() will split at every white space, and then you just want the 0 index, and 3rd index values there. So really just change 1 line of code:

from df[['Away','Home']] = df.Game.str.split('at',expand=True) to df[['Away','Home']] = df.Game.str.split(expand=True)[[0,3]]

import pandas as pd
import numpy as np

df = pd.read_html("https://www.sportsmediawatch.com/2011/12/revised-2011-12-nba-national-tv-schedule/", header=0)[0]

revisedCols = ['Date'] + [ col for col in df.columns if 'Revised' in col ]
df = df[revisedCols]

df.columns = df.iloc[0,:]

df = df.iloc[1:,:].reset_index(drop=True)


# Format Date to m/d/y
df['Date'] = np.where(df.Date.str.startswith(('10/', '11/', '12/')), df.Date + ' 11', df.Date + ' 12')
df['Date']=pd.to_datetime(df['Date'])
df['Date']=df['Date'].dt.strftime('%m/%d/%Y')

# Split the Game column
df[['Away','Home']] = df.Game.str.split(expand=True)[[0,3]]   


# Final dataframe with desired columns
df = df[['Date','Time','Away','Home','Net']]

df.columns = ['Date', 'Time', 'Away', 'Home', 'Network']

print(df)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM