[英]What is a more efficient or "pythonic" way to clean column headers?
我正在从职业足球参考网站上提取一些数据。 所有信息都很好,但列标题有点乱。 我写了一些代码来清理它,但感觉不太“正确”。 这似乎有点太重复了,因为我一直在同一个 for 循环中重新分配同一个变量。
import pandas as pd
import re
# Pulling data
url = 'https://www.pro-football-reference.com/years/2019/opp.htm'
df = pd.read_html(url)[0]
# Cleaning column headers
col_headers = []
regex = re.compile('[()\'\':_0-9]')
for x in df.columns:
y = (str(x).replace('1st', 'First'))
y = (y.replace('%', 'Pct'))
y = regex.sub('', y)
y = y.strip('Unnamed level, ')
col_headers.append(y)
在上面的代码中,我将所需的列 header output 返回到一个列表,然后我将在该列表中相应地重新分配列名。 但是,我觉得我没有有效地解决这个问题,我想知道是否有人对如何更好地构建我的这部分代码有任何建议。
列名是元组。 所以使用如下:
代码:
cols = [x[1] for x in df.columns]
cols = list(map(lambda x: x.replace('1st', 'First'), cols))
cols = list(map(lambda x: x.replace('%', 'Pct'), cols))
print(cols)
Output:
['Rk', 'Tm', 'G', 'PF', 'Yds', 'Ply', 'Y/P', 'TO', 'FL', 'FirstD', 'Cmp', 'Att', 'Yds', 'TD', 'Int', 'NY/A', 'FirstD', 'Att', 'Yds', 'TD', 'Y/A', 'FirstD', 'Pen', 'Yds', 'FirstPy', 'ScPct', 'TOPct', 'EXP']
您可以对pd.Index
使用.str
方法:
df.columns = (df.columns
.str.replace('1st', 'First')
.str.replace('%', 'Pct')
.str.replace(r'[()\'\':_0-9]', '')
.str.strip('Unnamed level, '))
将多级索引列转换为dataframe并处理。
obj_col = pd.DataFrame(df.columns.tolist())
# replace column level 0 with 'Unnamed:' as ''
cond = obj_col[0].str.contains('Unnamed:')
obj_col['0_'] = np.where(cond, '', obj_col[0])
# special handle
obj_col['1_'] = (obj_col[1].str.replace('1st', 'First')
.str.replace ('%', 'Pct'))
# concat and strip
obj_col['col_name'] = ((obj_col['0_'] + ', ' + obj_col['1_'])
.str.strip(', '))
print(obj_col)
output
0 1 0_ 1_ col_name
0 Unnamed: 0_level_0 Rk Rk Rk
1 Unnamed: 1_level_0 Tm Tm Tm
2 Unnamed: 2_level_0 G G G
3 Unnamed: 3_level_0 PF PF PF
4 Unnamed: 4_level_0 Yds Yds Yds
5 Tot Yds & TO Ply Tot Yds & TO Ply Tot Yds & TO, Ply
6 Tot Yds & TO Y/P Tot Yds & TO Y/P Tot Yds & TO, Y/P
7 Tot Yds & TO TO Tot Yds & TO TO Tot Yds & TO, TO
8 Unnamed: 8_level_0 FL FL FL
9 Unnamed: 9_level_0 1stD FirstD FirstD
10 Passing Cmp Passing Cmp Passing, Cmp
11 Passing Att Passing Att Passing, Att
12 Passing Yds Passing Yds Passing, Yds
13 Passing TD Passing TD Passing, TD
14 Passing Int Passing Int Passing, Int
15 Passing NY/A Passing NY/A Passing, NY/A
16 Passing 1stD Passing FirstD Passing, FirstD
17 Rushing Att Rushing Att Rushing, Att
18 Rushing Yds Rushing Yds Rushing, Yds
19 Rushing TD Rushing TD Rushing, TD
20 Rushing Y/A Rushing Y/A Rushing, Y/A
21 Rushing 1stD Rushing FirstD Rushing, FirstD
22 Penalties Pen Penalties Pen Penalties, Pen
23 Penalties Yds Penalties Yds Penalties, Yds
24 Penalties 1stPy Penalties FirstPy Penalties, FirstPy
25 Unnamed: 25_level_0 Sc% ScPct ScPct
26 Unnamed: 26_level_0 TO% TOPct TOPct
27 Unnamed: 27_level_0 EXP EXP EXP
也许像这样,你的例子也很好。
def clean_column_header(column):
regex = re.compile('[()\'\':_0-9]')
column = column.replace("1st", "First") \
.replace("%", "Pct") \
.strip("Unnamed level, ")
return regex.sub("", column)
df = df.rename(columns = {column:clean_column_header(column) for column in df.columns})
pd.DataFrame
有一个重命名属性,您可以直接将其用于此类情况:
import pandas as pd
import re
# Pulling data
url = 'https://www.pro-football-reference.com/years/2019/opp.htm'
df = pd.read_html(url)[0]
regex = re.compile('[()\'\':_0-9]')
df.rename(columns = lambda x: regex.sub('',str(x).replace('1st', 'First').replace('%', 'Pct')).strip('Unnamed level, '), inplace=True)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.