[英]What is a more efficient or "pythonic" way to clean column headers?
我正在從職業足球參考網站上提取一些數據。 所有信息都很好,但列標題有點亂。 我寫了一些代碼來清理它,但感覺不太“正確”。 這似乎有點太重復了,因為我一直在同一個 for 循環中重新分配同一個變量。
import pandas as pd
import re
# Pulling data
url = 'https://www.pro-football-reference.com/years/2019/opp.htm'
df = pd.read_html(url)[0]
# Cleaning column headers
col_headers = []
regex = re.compile('[()\'\':_0-9]')
for x in df.columns:
y = (str(x).replace('1st', 'First'))
y = (y.replace('%', 'Pct'))
y = regex.sub('', y)
y = y.strip('Unnamed level, ')
col_headers.append(y)
在上面的代碼中,我將所需的列 header output 返回到一個列表,然后我將在該列表中相應地重新分配列名。 但是,我覺得我沒有有效地解決這個問題,我想知道是否有人對如何更好地構建我的這部分代碼有任何建議。
列名是元組。 所以使用如下:
代碼:
cols = [x[1] for x in df.columns]
cols = list(map(lambda x: x.replace('1st', 'First'), cols))
cols = list(map(lambda x: x.replace('%', 'Pct'), cols))
print(cols)
Output:
['Rk', 'Tm', 'G', 'PF', 'Yds', 'Ply', 'Y/P', 'TO', 'FL', 'FirstD', 'Cmp', 'Att', 'Yds', 'TD', 'Int', 'NY/A', 'FirstD', 'Att', 'Yds', 'TD', 'Y/A', 'FirstD', 'Pen', 'Yds', 'FirstPy', 'ScPct', 'TOPct', 'EXP']
您可以對pd.Index
使用.str
方法:
df.columns = (df.columns
.str.replace('1st', 'First')
.str.replace('%', 'Pct')
.str.replace(r'[()\'\':_0-9]', '')
.str.strip('Unnamed level, '))
將多級索引列轉換為dataframe並處理。
obj_col = pd.DataFrame(df.columns.tolist())
# replace column level 0 with 'Unnamed:' as ''
cond = obj_col[0].str.contains('Unnamed:')
obj_col['0_'] = np.where(cond, '', obj_col[0])
# special handle
obj_col['1_'] = (obj_col[1].str.replace('1st', 'First')
.str.replace ('%', 'Pct'))
# concat and strip
obj_col['col_name'] = ((obj_col['0_'] + ', ' + obj_col['1_'])
.str.strip(', '))
print(obj_col)
output
0 1 0_ 1_ col_name
0 Unnamed: 0_level_0 Rk Rk Rk
1 Unnamed: 1_level_0 Tm Tm Tm
2 Unnamed: 2_level_0 G G G
3 Unnamed: 3_level_0 PF PF PF
4 Unnamed: 4_level_0 Yds Yds Yds
5 Tot Yds & TO Ply Tot Yds & TO Ply Tot Yds & TO, Ply
6 Tot Yds & TO Y/P Tot Yds & TO Y/P Tot Yds & TO, Y/P
7 Tot Yds & TO TO Tot Yds & TO TO Tot Yds & TO, TO
8 Unnamed: 8_level_0 FL FL FL
9 Unnamed: 9_level_0 1stD FirstD FirstD
10 Passing Cmp Passing Cmp Passing, Cmp
11 Passing Att Passing Att Passing, Att
12 Passing Yds Passing Yds Passing, Yds
13 Passing TD Passing TD Passing, TD
14 Passing Int Passing Int Passing, Int
15 Passing NY/A Passing NY/A Passing, NY/A
16 Passing 1stD Passing FirstD Passing, FirstD
17 Rushing Att Rushing Att Rushing, Att
18 Rushing Yds Rushing Yds Rushing, Yds
19 Rushing TD Rushing TD Rushing, TD
20 Rushing Y/A Rushing Y/A Rushing, Y/A
21 Rushing 1stD Rushing FirstD Rushing, FirstD
22 Penalties Pen Penalties Pen Penalties, Pen
23 Penalties Yds Penalties Yds Penalties, Yds
24 Penalties 1stPy Penalties FirstPy Penalties, FirstPy
25 Unnamed: 25_level_0 Sc% ScPct ScPct
26 Unnamed: 26_level_0 TO% TOPct TOPct
27 Unnamed: 27_level_0 EXP EXP EXP
也許像這樣,你的例子也很好。
def clean_column_header(column):
regex = re.compile('[()\'\':_0-9]')
column = column.replace("1st", "First") \
.replace("%", "Pct") \
.strip("Unnamed level, ")
return regex.sub("", column)
df = df.rename(columns = {column:clean_column_header(column) for column in df.columns})
pd.DataFrame
有一個重命名屬性,您可以直接將其用於此類情況:
import pandas as pd
import re
# Pulling data
url = 'https://www.pro-football-reference.com/years/2019/opp.htm'
df = pd.read_html(url)[0]
regex = re.compile('[()\'\':_0-9]')
df.rename(columns = lambda x: regex.sub('',str(x).replace('1st', 'First').replace('%', 'Pct')).strip('Unnamed level, '), inplace=True)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.