[英]Functions of pandas data frames and side effects
I want to write a function that takes as input a Pandas data frame and returns the only the rows with an average greater than some specified threshold. 我想编写一个函数,将Pandas数据框作为输入,并仅返回平均值大于某些指定阈值的行。 The function works , but it has a side effect of changing the input, which I don't want to do.
该函数有效 ,但是它具有更改输入的副作用,而我不想这样做。
def Remove_Low_Average(df, sample_names, average_threshold=30):
data_frame = df
data_frame['Mean'] = np.mean(data_frame[sample_names], axis=1)
data_frame = data_frame[data_frame.Mean > 30]
return data_frame.reset_index(drop=True)
Example: 例:
In [7]: junk_data = DataFrame(np.random.randn(5,5), columns=['a', 'b', 'c', 'd', 'e'])
In [8]: Remove_Low_Average(junk_data, ['a', 'b', 'c'], average_threshold=0)
In [9]: junk_data.columns
Out[9]: Index([u'a', u'b', u'c', u'd', u'e', u'Mean'], dtype='object')
So junk_data now has 'Mean' in its columns even though this was never assigned in the function. 因此,即使在函数中从未分配过junk_data,现在其栏仍具有“均值”。 I realize I could do this in a simpler manner, but this illustrates a problem I've been having regularly I can't figure out why.
我意识到我可以用一种更简单的方式做到这一点,但这说明了我经常遇到的一个问题,我不知道为什么。 I figure that this has to be a well-known thing, but I don't know how to get this side effect to stop happening.
我认为这必须是众所周知的事情,但是我不知道如何避免这种副作用的发生。
EDIT : EdChum's link below answers the question. 编辑 :下面的EdChum的链接回答了这个问题。
You don't need to copy the old dataframe, just don't assign a new column :) 您不需要复制旧的数据框,只需要分配一个新列即可:)
def remove_low_average(df, sample_names, average_threshold=30):
mean = df[sample_names].mean(axis=1)
return df.ix[mean > average_threshold]
# then use it as:
df = remove_low_average(df, ['a', 'b'])
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.