簡體   English   中英

清理列標題的更有效或“pythonic”方式是什么?

[英]What is a more efficient or "pythonic" way to clean column headers?

我正在從職業足球參考網站上提取一些數據。 所有信息都很好,但列標題有點亂。 我寫了一些代碼來清理它,但感覺不太“正確”。 這似乎有點太重復了,因為我一直在同一個 for 循環中重新分配同一個變量。

import pandas as pd
import re

# Pulling data
url = 'https://www.pro-football-reference.com/years/2019/opp.htm'
df = pd.read_html(url)[0]

# Cleaning column headers
col_headers = []
regex = re.compile('[()\'\':_0-9]')
for x in df.columns:
    y = (str(x).replace('1st', 'First'))
    y = (y.replace('%', 'Pct'))
    y = regex.sub('', y)
    y = y.strip('Unnamed level, ')
    col_headers.append(y)

在上面的代碼中,我將所需的列 header output 返回到一個列表,然后我將在該列表中相應地重新分配列名。 但是,我覺得我沒有有效地解決這個問題,我想知道是否有人對如何更好地構建我的這部分代碼有任何建議。

列名是元組。 所以使用如下:

代碼:

cols = [x[1] for x in df.columns]
cols = list(map(lambda x: x.replace('1st', 'First'), cols))
cols = list(map(lambda x: x.replace('%', 'Pct'), cols))
print(cols)

Output:

['Rk', 'Tm', 'G', 'PF', 'Yds', 'Ply', 'Y/P', 'TO', 'FL', 'FirstD', 'Cmp', 'Att', 'Yds', 'TD', 'Int', 'NY/A', 'FirstD', 'Att', 'Yds', 'TD', 'Y/A', 'FirstD', 'Pen', 'Yds', 'FirstPy', 'ScPct', 'TOPct', 'EXP']

您可以對pd.Index使用.str方法:

df.columns = (df.columns
                .str.replace('1st', 'First')
                .str.replace('%', 'Pct')
                .str.replace(r'[()\'\':_0-9]', '')
                .str.strip('Unnamed level, '))

將多級索引列轉換為dataframe並處理。

obj_col = pd.DataFrame(df.columns.tolist())

# replace column level 0 with 'Unnamed:' as ''
cond = obj_col[0].str.contains('Unnamed:')
obj_col['0_'] = np.where(cond, '', obj_col[0])

# special handle
obj_col['1_'] = (obj_col[1].str.replace('1st', 'First')
                           .str.replace ('%', 'Pct'))

# concat and strip
obj_col['col_name'] = ((obj_col['0_'] + ', ' + obj_col['1_'])
                        .str.strip(', '))

print(obj_col)

output

                     0      1            0_       1_            col_name
0    Unnamed: 0_level_0     Rk                     Rk                  Rk
1    Unnamed: 1_level_0     Tm                     Tm                  Tm
2    Unnamed: 2_level_0      G                      G                   G
3    Unnamed: 3_level_0     PF                     PF                  PF
4    Unnamed: 4_level_0    Yds                    Yds                 Yds
5          Tot Yds & TO    Ply  Tot Yds & TO      Ply   Tot Yds & TO, Ply
6          Tot Yds & TO    Y/P  Tot Yds & TO      Y/P   Tot Yds & TO, Y/P
7          Tot Yds & TO     TO  Tot Yds & TO       TO    Tot Yds & TO, TO
8    Unnamed: 8_level_0     FL                     FL                  FL
9    Unnamed: 9_level_0   1stD                 FirstD              FirstD
10              Passing    Cmp       Passing      Cmp        Passing, Cmp
11              Passing    Att       Passing      Att        Passing, Att
12              Passing    Yds       Passing      Yds        Passing, Yds
13              Passing     TD       Passing       TD         Passing, TD
14              Passing    Int       Passing      Int        Passing, Int
15              Passing   NY/A       Passing     NY/A       Passing, NY/A
16              Passing   1stD       Passing   FirstD     Passing, FirstD
17              Rushing    Att       Rushing      Att        Rushing, Att
18              Rushing    Yds       Rushing      Yds        Rushing, Yds
19              Rushing     TD       Rushing       TD         Rushing, TD
20              Rushing    Y/A       Rushing      Y/A        Rushing, Y/A
21              Rushing   1stD       Rushing   FirstD     Rushing, FirstD
22            Penalties    Pen     Penalties      Pen      Penalties, Pen
23            Penalties    Yds     Penalties      Yds      Penalties, Yds
24            Penalties  1stPy     Penalties  FirstPy  Penalties, FirstPy
25  Unnamed: 25_level_0    Sc%                  ScPct               ScPct
26  Unnamed: 26_level_0    TO%                  TOPct               TOPct
27  Unnamed: 27_level_0    EXP                    EXP                 EXP

也許像這樣,你的例子也很好。

def clean_column_header(column):

    regex = re.compile('[()\'\':_0-9]')
    
    column = column.replace("1st", "First") \
                   .replace("%", "Pct") \
                   .strip("Unnamed level, ")

    return regex.sub("", column)

df = df.rename(columns = {column:clean_column_header(column) for column in df.columns})

pd.DataFrame有一個重命名屬性,您可以直接將其用於此類情況:

import pandas as pd
import re

# Pulling data
url = 'https://www.pro-football-reference.com/years/2019/opp.htm'
df = pd.read_html(url)[0]

regex = re.compile('[()\'\':_0-9]')
df.rename(columns = lambda x: regex.sub('',str(x).replace('1st', 'First').replace('%', 'Pct')).strip('Unnamed level, '), inplace=True)

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM