[英]Adding a column to a python pandas data frame based on the value of another column
I have some pandas data frame, and I would like to add a column that is the difference of a column, based on the value of a third column. 我有一些熊猫数据框,我想根据第三列的值添加一列,该列与列的不同之处。 Here is a toy example:
这是一个玩具示例:
import pandas as pd
import numpy as np
d = {'one' : pd.Series(range(4), index=['a', 'b', 'c', 'd']),
'two' : pd.Series(range(4), index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
df['three'] = [2,2,3,3]
four = []
for i in set(df['three']):
for j in range(len(df) -1):
four.append(df[df['three'] == i]['two'][j + 1] - df[df['three']==i]['two'][j])
four.append(0)
df['four'] = four
The final column should be [1, 1, 1, Nan], since that is the difference between each of the rows in the 'two' column 最后一列应为[1,1,1,Nan],因为那是'two'列中每一行之间的差异
This makes more sense in the context of my original code -- my data frame is organized by some IDs, and then by time, and when I take the subset of the data frame by IDs, I'm left with the time series evolution of the variables for each individual ID. 这在我的原始代码的上下文中更有意义-我的数据帧是由一些ID组成,然后按时间组织的,当我按ID来获取数据帧的子集时,剩下的时间序列是每个ID的变量。 However, I keep on either receiving a key error, or attempting to edit a copy of the original data frame.
但是,我会继续收到一个关键错误,或者尝试编辑原始数据框的副本。 What is the right way to go about this?
解决这个问题的正确方法是什么?
You could replace df[df['three'] == i]
with a groupby
on column three. 您可以在第三列使用
groupby
替换df[df['three'] == i]
。 And perhaps replace ['two'][j + 1] - ['two'][j]
with df['two'].shift(-1) - df['two']
. 也许用
df['two'].shift(-1) - df['two']
替换['two'][j + 1] - ['two'][j]
df['two'].shift(-1) - df['two']
。
I think that would be identical to what you are doing now within the nested loop. 我认为这与您现在在嵌套循环中所做的相同。 It depends a bit on what format you want as a result on how you would implement this.
这取决于您想要哪种格式,以及如何实现此格式。 One way would be:
一种方法是:
df.groupby('three').apply(lambda grp: pd.Series(grp['two'].shift(-1) - grp['two']))
Which would result in: 这将导致:
two a b
three
2 1 NaN
3 1 NaN
The columns names become a bit meaningless after this operation. 在执行此操作后,列名变得毫无意义。
如果您要做的只是获取第二列的行之间的差,请使用shift方法。
df['four'] = df.two.shift(-1) - df.two
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.