繁体   English   中英

清理列标题的更有效或“pythonic”方式是什么?

[英]What is a more efficient or "pythonic" way to clean column headers?

我正在从职业足球参考网站上提取一些数据。 所有信息都很好,但列标题有点乱。 我写了一些代码来清理它,但感觉不太“正确”。 这似乎有点太重复了,因为我一直在同一个 for 循环中重新分配同一个变量。

import pandas as pd
import re

# Pulling data
url = 'https://www.pro-football-reference.com/years/2019/opp.htm'
df = pd.read_html(url)[0]

# Cleaning column headers
col_headers = []
regex = re.compile('[()\'\':_0-9]')
for x in df.columns:
    y = (str(x).replace('1st', 'First'))
    y = (y.replace('%', 'Pct'))
    y = regex.sub('', y)
    y = y.strip('Unnamed level, ')
    col_headers.append(y)

在上面的代码中,我将所需的列 header output 返回到一个列表,然后我将在该列表中相应地重新分配列名。 但是,我觉得我没有有效地解决这个问题,我想知道是否有人对如何更好地构建我的这部分代码有任何建议。

列名是元组。 所以使用如下:

代码:

cols = [x[1] for x in df.columns]
cols = list(map(lambda x: x.replace('1st', 'First'), cols))
cols = list(map(lambda x: x.replace('%', 'Pct'), cols))
print(cols)

Output:

['Rk', 'Tm', 'G', 'PF', 'Yds', 'Ply', 'Y/P', 'TO', 'FL', 'FirstD', 'Cmp', 'Att', 'Yds', 'TD', 'Int', 'NY/A', 'FirstD', 'Att', 'Yds', 'TD', 'Y/A', 'FirstD', 'Pen', 'Yds', 'FirstPy', 'ScPct', 'TOPct', 'EXP']

您可以对pd.Index使用.str方法:

df.columns = (df.columns
                .str.replace('1st', 'First')
                .str.replace('%', 'Pct')
                .str.replace(r'[()\'\':_0-9]', '')
                .str.strip('Unnamed level, '))

将多级索引列转换为dataframe并处理。

obj_col = pd.DataFrame(df.columns.tolist())

# replace column level 0 with 'Unnamed:' as ''
cond = obj_col[0].str.contains('Unnamed:')
obj_col['0_'] = np.where(cond, '', obj_col[0])

# special handle
obj_col['1_'] = (obj_col[1].str.replace('1st', 'First')
                           .str.replace ('%', 'Pct'))

# concat and strip
obj_col['col_name'] = ((obj_col['0_'] + ', ' + obj_col['1_'])
                        .str.strip(', '))

print(obj_col)

output

                     0      1            0_       1_            col_name
0    Unnamed: 0_level_0     Rk                     Rk                  Rk
1    Unnamed: 1_level_0     Tm                     Tm                  Tm
2    Unnamed: 2_level_0      G                      G                   G
3    Unnamed: 3_level_0     PF                     PF                  PF
4    Unnamed: 4_level_0    Yds                    Yds                 Yds
5          Tot Yds & TO    Ply  Tot Yds & TO      Ply   Tot Yds & TO, Ply
6          Tot Yds & TO    Y/P  Tot Yds & TO      Y/P   Tot Yds & TO, Y/P
7          Tot Yds & TO     TO  Tot Yds & TO       TO    Tot Yds & TO, TO
8    Unnamed: 8_level_0     FL                     FL                  FL
9    Unnamed: 9_level_0   1stD                 FirstD              FirstD
10              Passing    Cmp       Passing      Cmp        Passing, Cmp
11              Passing    Att       Passing      Att        Passing, Att
12              Passing    Yds       Passing      Yds        Passing, Yds
13              Passing     TD       Passing       TD         Passing, TD
14              Passing    Int       Passing      Int        Passing, Int
15              Passing   NY/A       Passing     NY/A       Passing, NY/A
16              Passing   1stD       Passing   FirstD     Passing, FirstD
17              Rushing    Att       Rushing      Att        Rushing, Att
18              Rushing    Yds       Rushing      Yds        Rushing, Yds
19              Rushing     TD       Rushing       TD         Rushing, TD
20              Rushing    Y/A       Rushing      Y/A        Rushing, Y/A
21              Rushing   1stD       Rushing   FirstD     Rushing, FirstD
22            Penalties    Pen     Penalties      Pen      Penalties, Pen
23            Penalties    Yds     Penalties      Yds      Penalties, Yds
24            Penalties  1stPy     Penalties  FirstPy  Penalties, FirstPy
25  Unnamed: 25_level_0    Sc%                  ScPct               ScPct
26  Unnamed: 26_level_0    TO%                  TOPct               TOPct
27  Unnamed: 27_level_0    EXP                    EXP                 EXP

也许像这样,你的例子也很好。

def clean_column_header(column):

    regex = re.compile('[()\'\':_0-9]')
    
    column = column.replace("1st", "First") \
                   .replace("%", "Pct") \
                   .strip("Unnamed level, ")

    return regex.sub("", column)

df = df.rename(columns = {column:clean_column_header(column) for column in df.columns})

pd.DataFrame有一个重命名属性,您可以直接将其用于此类情况:

import pandas as pd
import re

# Pulling data
url = 'https://www.pro-football-reference.com/years/2019/opp.htm'
df = pd.read_html(url)[0]

regex = re.compile('[()\'\':_0-9]')
df.rename(columns = lambda x: regex.sub('',str(x).replace('1st', 'First').replace('%', 'Pct')).strip('Unnamed level, '), inplace=True)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM