So I need to assemble a dictionary of pandas series' and I was wondering if it would be faster to just pass a reference to the series instead of copying over all the data into the dictionary. I have the code:
df = pd.read_csv('data.csv')
dict = {
'Start' : df['Start']
}
print(dict.get('Start'))
I tried to change the data to see if it was copying over the data so I did
dict = {
'Start' : df['Start']
}
df['Start'] = df['End']
print(dict.get('Start'))
but this didn't change the output of the code at all, showing that the dictionary contains a copy of the series. I think this would be slower than just passing a reference so is it possible for me to just assign a reference to the value inside the dict?
df['Start'] = df['End']
Is not a reliable way to test this. Basically, pandas
makes no guarantees (or not a lot) about underlying buffer that represents the data in the dataframe. All of this relies on implementation details, the block manager will try to keep things stored efficiently in like-typed blocks, which is possible if the dtypes are homogenous, but in the case of hetergenous dtypes
df['Start'] = df['End']
Could potentially re-arrange the way the dataframe is represented.
A more reliable way to test the copying behavior is to modify a single element without changing the type of the column. So assuming "Start"
is all integers:
>>> df = pd.DataFrame({"start":[1,2,3], "end":[4,5,6]})
>>> df
start end
0 1 4
1 2 5
2 3 6
>>> d = {'start':df['start']}
>>> df.loc[0, 'start'] = 99
>>> d
{'start': 0 99
1 2
2 3
Name: start, dtype: int64}
But I'm not sure about any guarantees that pandas makes about df[column]
, but in my experience, it always returns a view. However, it is a view of the underlying data in the block manger at that time . Mutating your dataframe can easily change that underlying buffer.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.