簡體   English   中英

將大量列添加到 Pandas DataFrame 的最佳實踐

[英]Best Practice for Adding Lots of Columns to Pandas DataFrame

我正在嘗試向 pandas dataframe 添加許多列,如下所示:

def create_sum_rounds(df, col_name_base):
    '''
    Create a summed column in df from base columns. For example,
    df['sum_foo'] = df['foo_1'] + df['foo_2'] + df['foo_3'] + \
                    df['foo_4'] + df['foo_5'] +
    '''
    out_name = 'sum_' + col_name_base
    df[out_name] = 0.0
    for i in range(1, 6):
        col_name = col_name_base + str(i)
        if col_name in df:
            df[out_name] += df[col_name]
        else:
            logger.error('Col %s not in df' % col_name)

for col in sum_cols_list:
    create_sum_rounds(df, col)

其中sum_cols_list是約 200 個基本列名稱的列表(例如"foo" ),而df是 pandas dataframe 包括擴展為 1 到 5 的基本列(例如"foo_1", "foo_2",..., "foo_5" )。

運行此代碼段時,我收到了性能警告:

PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`

我相信這是因為創建新列實際上是在后台調用插入操作。 在這種情況下使用 pd.concat 的正確方法是什么?

簡化:-)

def create_sum_rounds(df, col_name_base):
    '''
    Create a summed column in df from base columns. For example,
    df['sum_foo'] = df['foo_1'] + df['foo_2'] + df['foo_3'] + \
                    df['foo_4'] + df['foo_5'] +
    '''
    out_name = 'sum_' + col_name_base
    df[out_name] = df.loc[:,[x for x in df.columns if x.startswith(col_name_base)]].sum(axis=1)

這會讓你得到你期望的結果嗎?

df = pd.DataFrame({
    'Foo_1' : [1, 2, 3, 4, 5],
    'Foo_2' : [10, 20, 30, 40, 50],
    'Something' : ['A', 'B', 'C', 'D', 'E']
})

df['Foo_Sum'] = df.filter(like = 'Foo_').sum(axis = 1)

您可以使用相同的方法,但不是直接在DataFrame上操作,而是需要將每個 output 存儲為自己的pd.Series 然后,當所有計算完成后,使用pd.concat將所有內容粘回到原來的DataFrame

(未經測試,但應該可以)

import pandas as pd

def create_sum_rounds(df, col_name_base):
    '''
    Create a summed column in df from base columns. For example,
    df['sum_foo'] = df['foo_1'] + df['foo_2'] + df['foo_3'] + \
                    df['foo_4'] + df['foo_5'] +
    '''
    out = pd.Series(0, name='sum_' + col_name_base, index=df.index)
    for i in range(1, 6):
        col_name = col_name_base + str(i)
        if col_name in df:
            out += df[col_name]
        else:
            logger.error('Col %s not in df' % col_name)

col_sums = []
for col in sum_cols_list:
    col_sums.append(create_sum_rounds(df, col))
new_df = pd.concat([df, *col_sums], axis=1)

此外,您可以簡化現有代碼(如果您願意放棄日志記錄)

import pandas as pd

def create_sum_rounds(df, col_name_base):
    '''
    Create a summed column in df from base columns. For example,
    df['sum_foo'] = df['foo_1'] + df['foo_2'] + df['foo_3'] + \
                    df['foo_4'] + df['foo_5'] + ...
    '''
    return df.filter(regex=f'{col_name_base}_\d+').sum(axis=1)

col_sums = []
for col in sum_cols_list:
    col_sums.append(create_sum_rounds(df, col))
new_df = pd.concat([df, *col_sums], axis=1)

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM