I am trying to download data from a website. When I do this, there are some rows that are not part of the data included, which is obvious because their first column is not a number.
So I'm getting something like
GM_Num Date Tm
1 Monday, Apr 3 LAA
2 Tuesday, Apr 4 LAA
... ... ...
Gm# May Tm
where the last row is one that I want to drop. In the actual table, there are multiple rows like this randomly throughout the table.
Here is the code that I have tried so far to drop those rows:
import requests
import pandas as pd
url = 'https://www.baseball-reference.com/teams/LAA/2017-schedule-scores.shtml'
html = requests.get(url).content
df_list = pd.read_html(html)
df = df_list[-1]
df.rename(columns={"Gm#": "GM_Num"}, inplace = True)
#Attempts that didn't work:
df[df['GM_Num'].str.isdigit().isnull()]
#df[df.GM_Num.apply(lambda x: x.isnumeric())].set_index('GM_Num', inplace = True)
#df.set_index('GM_Num', inplace = True)
df
Thank you in advance for any help!
Let's cast your 'Gm#' column and drop records in a couple of steps:
df['Gm#'] = pd.to_numeric(df['Gm#'], errors='coerce')
df = df.dropna(subset=['Gm#'])
df
Output:
Gm# Date Unnamed: 2 Tm Unnamed: 4 Opp W/L R RA \
0 1.0 Monday, Apr 3 boxscore LAA @ OAK L 2 4
1 2.0 Tuesday, Apr 4 boxscore LAA @ OAK W 7 6
2 3.0 Wednesday, Apr 5 boxscore LAA @ OAK W 5 0
3 4.0 Thursday, Apr 6 boxscore LAA @ OAK L 1 5
4 5.0 Friday, Apr 7 boxscore LAA NaN SEA W 5 1
.. ... ... ... ... ... ... ... .. ..
162 158.0 Wednesday, Sep 27 boxscore LAA @ CHW L-wo 4 6
163 159.0 Thursday, Sep 28 boxscore LAA @ CHW L 4 5
164 160.0 Friday, Sep 29 boxscore LAA NaN SEA W 6 5
165 161.0 Saturday, Sep 30 boxscore LAA NaN SEA L 4 6
167 162.0 Sunday, Oct 1 boxscore LAA NaN SEA W 6 2
Inn ... Rank GB Win Loss Save Time D/N \
0 NaN ... 3 1.0 Graveman Nolasco Casilla 2:56 N
1 NaN ... 2 1.0 Bailey Dull Bedrosian 3:17 N
2 NaN ... 2 1.0 Ramirez Cotton NaN 3:15 N
3 NaN ... 2 1.0 Triggs Skaggs NaN 2:44 D
4 NaN ... 1 Tied Chavez Gallardo NaN 2:56 N
.. ... ... ... ... ... ... ... ... ..
162 10 ... 2 20.0 Farquhar Parker NaN 3:58 N
163 NaN ... 2 21.0 Infante Chavez Minaya 3:04 N
164 NaN ... 2 21.0 Wood Rzepczynski Parker 3:01 N
165 NaN ... 2 21.0 Lawrence Bedrosian Diaz 3:32 N
167 NaN ... 2 21.0 Bridwell Simmons NaN 2:38 D
Attendance Streak Orig. Scheduled
0 36067 - NaN
1 11225 + NaN
2 13405 ++ NaN
3 13292 - NaN
4 43911 + NaN
.. ... ... ...
162 17012 - NaN
163 19596 -- NaN
164 35106 + NaN
165 38075 - NaN
167 34940 + NaN
[162 rows x 21 columns]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.