简体   繁体   English

当我通过以下方式修改熊猫数据框时会发生什么

[英]what happens when I modify a pandas dataframe in the following way

trying to understand this behavior (why it happens; and if it was intentional, then what was the motivation for it to be done this way) 试图理解这种行为(为什么发生;如果是故意的,那么这样做的动机是什么)

So I create a dataframe 所以我创建一个数据框

np.random.seed(0)
df = pd.DataFrame(np.random.random((4,2)))


          0         1
0  0.548814  0.715189
1  0.602763  0.544883
2  0.423655  0.645894
3  0.437587  0.891773

and I can reference columns like so 我可以像这样引用列

df.columns = ['a','b']
df.a
          0
0  0.548814
1  0.602763
2  0.423655
3  0.437587 

I can even make, what I think is a new column 我什至可以说,我认为这是一个新专栏

 df.third = pd.DataFrame(np.random.random((4,1)))

but df is still 但是df仍然

df
          0         1
0  0.548814  0.715189
1  0.602763  0.544883
2  0.423655  0.645894
3  0.437587  0.891773

however, df.third also exists (but i can't see it in my variable viewer in Spyder) 但是, df.third也存在(但是我在Spyder的变量查看器中看不到它)

df.third
          0
0  0.118274
1  0.639921
2  0.143353
3  0.944669

if I wanted to add a third column, I'd have to do the following 如果我想添加第三列,则必须执行以下操作

df['third'] = pd.DataFrame(np.random.random((4,1)))

          a         b     third
0  0.548814  0.715189  0.568045
1  0.602763  0.544883  0.925597
2  0.423655  0.645894  0.071036
3  0.437587  0.891773  0.087129

So, my question is what's going on when I do df.third versus df['third']? 因此,我的问题是当我执行df.third与df ['third']时发生了什么?

Because it added third as an attribute, you should stop accessing columns as an attribute and always use df['third'] to avoid ambiguous behaviour. 因为它添加了third作为属性,所以您应该停止访问列作为属性,并始终使用df['third']以避免歧义行为。

You should get into the habit of always accessing and assigning columns using df[col_name] , this is to avoid problems like 您应该养成始终使用df[col_name]访问和分配列的习惯,这是为了避免出现诸如

df.mean = some_calc()

well the problem here is that mean is a method for a DataFrame 好吧,这里的问题是,这mean DataFrame的方法

So you've then overwritten a method with some computed value. 因此,您已经用一些计算值覆盖了方法。

The problem here is that this was part of the design as a convenience and the pandas for data analysis book and some early online video presentations showed this as a way of assigning to a new column but the subtle errors can be so pervasive that it really should be banned and removed IMO 这里的问题是,这是为了方便起见而设计的一部分,数据分析书中的熊猫和一些早期的在线视频演示将其作为分配给新列的一种方式,但是细微的错误可能是如此普遍,以至于它确实应该被禁止和删除IMO

Seriously I can't stress this enough, stop referring to columns as an attribute , it's a serious bugbear of mine and unfortunately I still see lots of answers posted showing this usage 严重的是,我不能对此施加太大压力, 不要再将列称为属性 ,这是我的一个严重错误,但是不幸的是,我仍然看到很多答案显示此用法

You can see that no new column is added: 您可以看到未添加任何新列:

In [97]:
df.third = pd.DataFrame(np.random.random((4,1)))
df.columns

Out[97]:
Index(['a', 'b'], dtype='object')

You can see that third was added as an attribute: 您可以看到third个属性已添加:

In [98]:
df.__dict__

Out[98]:
{'_data': BlockManager
 Items: Index(['a', 'b'], dtype='object')
 Axis 1: Int64Index([0, 1, 2, 3], dtype='int64')
 FloatBlock: slice(0, 2, 1), 2 x 4, dtype: float64,
 '_iloc': <pandas.core.indexing._iLocIndexer at 0x7e73b00>,
 '_item_cache': {},
 'is_copy': None,
 'third':           0
 0  0.844821
 1  0.286501
 2  0.459170
 3  0.243452}

You can see that you have an Items , __data , Axis 1 etc but then you also have 'third' which is an attribute 您可以看到您有一个Items__dataAxis 1等,但是您还拥有一个'third'属性

我认为您向熊猫数据框对象添加了属性第三 ,如果您想添加名称为“第三”的列,则必须这样做:

df['third'] = pd.DataFrame(np.random.random((4,1)))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM