[英]Best Practice for Adding Lots of Columns to Pandas DataFrame
我正在嘗試向 pandas dataframe 添加許多列,如下所示:
def create_sum_rounds(df, col_name_base):
'''
Create a summed column in df from base columns. For example,
df['sum_foo'] = df['foo_1'] + df['foo_2'] + df['foo_3'] + \
df['foo_4'] + df['foo_5'] +
'''
out_name = 'sum_' + col_name_base
df[out_name] = 0.0
for i in range(1, 6):
col_name = col_name_base + str(i)
if col_name in df:
df[out_name] += df[col_name]
else:
logger.error('Col %s not in df' % col_name)
for col in sum_cols_list:
create_sum_rounds(df, col)
其中sum_cols_list
是約 200 個基本列名稱的列表(例如"foo"
),而df
是 pandas dataframe 包括擴展為 1 到 5 的基本列(例如"foo_1", "foo_2",..., "foo_5"
)。
運行此代碼段時,我收到了性能警告:
PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
我相信這是因為創建新列實際上是在后台調用插入操作。 在這種情況下使用 pd.concat 的正確方法是什么?
簡化:-)
def create_sum_rounds(df, col_name_base):
'''
Create a summed column in df from base columns. For example,
df['sum_foo'] = df['foo_1'] + df['foo_2'] + df['foo_3'] + \
df['foo_4'] + df['foo_5'] +
'''
out_name = 'sum_' + col_name_base
df[out_name] = df.loc[:,[x for x in df.columns if x.startswith(col_name_base)]].sum(axis=1)
這會讓你得到你期望的結果嗎?
df = pd.DataFrame({
'Foo_1' : [1, 2, 3, 4, 5],
'Foo_2' : [10, 20, 30, 40, 50],
'Something' : ['A', 'B', 'C', 'D', 'E']
})
df['Foo_Sum'] = df.filter(like = 'Foo_').sum(axis = 1)
您可以使用相同的方法,但不是直接在DataFrame
上操作,而是需要將每個 output 存儲為自己的pd.Series
。 然后,當所有計算完成后,使用pd.concat
將所有內容粘回到原來的DataFrame
。
(未經測試,但應該可以)
import pandas as pd
def create_sum_rounds(df, col_name_base):
'''
Create a summed column in df from base columns. For example,
df['sum_foo'] = df['foo_1'] + df['foo_2'] + df['foo_3'] + \
df['foo_4'] + df['foo_5'] +
'''
out = pd.Series(0, name='sum_' + col_name_base, index=df.index)
for i in range(1, 6):
col_name = col_name_base + str(i)
if col_name in df:
out += df[col_name]
else:
logger.error('Col %s not in df' % col_name)
col_sums = []
for col in sum_cols_list:
col_sums.append(create_sum_rounds(df, col))
new_df = pd.concat([df, *col_sums], axis=1)
此外,您可以簡化現有代碼(如果您願意放棄日志記錄)
import pandas as pd
def create_sum_rounds(df, col_name_base):
'''
Create a summed column in df from base columns. For example,
df['sum_foo'] = df['foo_1'] + df['foo_2'] + df['foo_3'] + \
df['foo_4'] + df['foo_5'] + ...
'''
return df.filter(regex=f'{col_name_base}_\d+').sum(axis=1)
col_sums = []
for col in sum_cols_list:
col_sums.append(create_sum_rounds(df, col))
new_df = pd.concat([df, *col_sums], axis=1)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.