Functions of pandas data frames and side effects

Question

I want to write a function that takes as input a Pandas data frame and returns the only the rows with an average greater than some specified threshold. The function works , but it has a side effect of changing the input, which I don't want to do.

def Remove_Low_Average(df, sample_names, average_threshold=30):
    data_frame = df
    data_frame['Mean'] = np.mean(data_frame[sample_names], axis=1)
    data_frame = data_frame[data_frame.Mean > 30]
    return data_frame.reset_index(drop=True)

Example:

In [7]: junk_data = DataFrame(np.random.randn(5,5), columns=['a', 'b', 'c', 'd', 'e'])
In [8]: Remove_Low_Average(junk_data, ['a', 'b', 'c'], average_threshold=0)
In [9]: junk_data.columns
Out[9]: Index([u'a', u'b', u'c', u'd', u'e', u'Mean'], dtype='object')

So junk_data now has 'Mean' in its columns even though this was never assigned in the function. I realize I could do this in a simpler manner, but this illustrates a problem I've been having regularly I can't figure out why. I figure that this has to be a well-known thing, but I don't know how to get this side effect to stop happening.

EDIT : EdChum's link below answers the question.

Answer 1

@EdChum answered this in the comments:

see this page so basically if you want to avoid modifying the original then perform a deep copy by calling .copy()

Answer 2

You don't need to copy the old dataframe, just don't assign a new column :)

def remove_low_average(df, sample_names, average_threshold=30):
    mean = df[sample_names].mean(axis=1)
    return df.ix[mean > average_threshold]

# then use it as:
df = remove_low_average(df, ['a', 'b'])

Functions of pandas data frames and side effects

Question

2 answers

solution1
0 ACCPTED 2014-07-15 21:06:21

solution2
0 2014-07-15 23:04:30

Functions of pandas data frames and side effects

Question

2 answers

solution1 0 ACCPTED 2014-07-15 21:06:21

solution2 0 2014-07-15 23:04:30

solution1
0 ACCPTED 2014-07-15 21:06:21

solution2
0 2014-07-15 23:04:30