[英]How to rename a pandas dataframe in a memory-efficient way (without creating a copy)?
I want to rename a pandas dataframe df_old
into df_new
. 我想将熊猫数据
df_old
df_new
重命名为df_new
。
Since df.rename only seems to be designed for single series/columns within a given dataframe, I use the following approach in the moment: 由于df.rename似乎只为给定数据帧内的单个系列/列而设计,因此目前我使用以下方法:
df_new = df_old
del df_old
However, this is not memory efficient at all, since it creates a copy of df_old
. 但是,由于创建了
df_old
的副本,因此这根本不提高内存效率。
How to rename a pandas dataframe in a more memory-efficient way, similar to inplace = True
? 如何以更节省内存的方式重命名pandas数据帧,类似于
inplace = True
?
The right answer to the question: 这个问题的正确答案:
"How to rename a pandas dataframe in a more memory-efficient way, similar to inplace = True?"
is: 是:
newName = oldName
is already a memory-efficient way of renamingnewName = oldName
已经是一种内存有效的重命名方法
Let's give a summary of what follows first: 让我们总结一下以下内容:
There is no significant change in memory requirement due to df_new = df_old
由于
df_new = df_old
,因此内存需求没有明显变化
There is a nice ressource explaining it all HERE telling: 有一个很好的ressource解释这一切这里讲:
Python's memory management is so central to its behavior, not only do you not have to delete values, but there is no way to delete values.
Python的内存管理对其行为如此重要,不仅您不必删除值,而且无法删除值。 You may have seen the del statement:
您可能已经看到了del语句:
nums = [1, 2, 3]
del nums
This does not delete the value nums, it deletes the name nums.
这不会删除数值num,而是删除名称nums。 The name is removed from its scope, and then the usual reference counting kicks in: if nums' value had only that one reference, then the value will be reclaimed.
该名称从其作用域中删除,然后进行常规引用计数:如果nums的值只有一个引用,则将回收该值。 But if it had other references, then it will not.
但是,如果它有其他引用,则不会。
All of the voluminous stuff below is just to provide another prove of what was stated above. 以下所有大量内容仅是为了提供上述证明。
See THIS code: 请参阅此代码:
from memory_profiler import profile
@profile(precision=4)
def my_func():
import pandas
df_old = pandas.DataFrame([1,2,3,4,5])
print(df_old)
print(id(df_old))
df_new = df_old
print(id(df_new), id(df_old))
del df_old
my_func()
on my box it gives: 在我的盒子上可以看到:
>python3.6 -u "renamePandas_Cg.py"
0
0 1
1 2
2 3
3 4
4 5
140482968978768
140482968978768 140482968978768
Filename: renamePandas_Cg.py
Line # Mem usage Increment Line Contents
================================================
3 31.1680 MiB 0.0000 MiB @profile(precision=4)
4 def my_func():
5 64.1250 MiB 32.9570 MiB import pandas
6
7 64.1953 MiB 0.0703 MiB df_old = pandas.DataFrame([1,2,3,4,5])
8 64.6680 MiB 0.4727 MiB print(df_old)
9 64.6680 MiB 0.0000 MiB print(id(df_old))
10 64.6680 MiB 0.0000 MiB df_new = df_old
11 64.6680 MiB 0.0000 MiB print(id(df_new), id(df_old))
12 64.6680 MiB 0.0000 MiB del df_old
What proves, that what is said in the comments is actually a fact, because both df_old and df_new point to the same address in memory AND there is NO INCREASE in memory because of df_new = df_old
. 事实证明,注释中所说的实际上是事实,因为df_old和df_new都指向内存中的相同地址,并且由于
df_new = df_old
在内存中没有增加 。
Let's see if shown no increase in memory is only because of too small precision. 让我们看看是否显示内存增加仅仅是因为精度太低。 Here the result for presision=7 :
在这里,presision的结果= 7:
>python3.6 -u "renamePandas_Cg.py"
0
0 1
1 2
2 3
3 4
4 5
140698387071216
140698387071216 140698387071216
Filename: renamePandas_Cg.py
Line # Mem usage Increment Line Contents
================================================
3 31.1718750 MiB 0.0000000 MiB @profile(precision=7)
4 def my_func():
5 64.1992188 MiB 33.0273438 MiB import pandas
6
7 64.3125000 MiB 0.1132812 MiB df_old = pandas.DataFrame([1,2,3,4,5])
8 64.7226562 MiB 0.4101562 MiB print(df_old)
9 64.7226562 MiB 0.0000000 MiB print(id(df_old))
10 64.7226562 MiB 0.0000000 MiB df_new = df_old
11 64.7226562 MiB 0.0000000 MiB print(id(df_new), id(df_old))
12 64.7226562 MiB 0.0000000 MiB del df_old
Hmmm ... The memory increase is not the same as before ... and inconsistent changing from one run to another. 嗯...内存增加与以前不一样...并且从一次运行到另一次运行不一致。
By the way if you still doubt the results because the dataframe is so small change df_old = pandas.DataFrame([1,2,3,4,5])
to df_old = pandas.DataFrame(100000*[1,2,3,4,5])
and you will see same results as before, except that the statement df_old = pandas.DataFrame(100000*[1,2,3,4,5])
consumes more than 7 MByte of memory space. 顺便说一句,如果您仍然怀疑结果,因为数据帧太小,
df_old = pandas.DataFrame([1,2,3,4,5])
更改为df_old = pandas.DataFrame(100000*[1,2,3,4,5])
,您将看到与以前相同的结果,除了语句df_old = pandas.DataFrame(100000*[1,2,3,4,5])
占用了7 MB以上的内存空间。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.