[英]Python pandas: replace select values in groupby object
I have a large dataframe with individual-level data in four columns: a person id number, her year, her age, and her moving status. 我有一个很大的数据框,其中包含四列的个人级别数据:一个人的身分证件号码,她的年龄,她的年龄和她的移动状态。 I use
groupby
on the person id number, stored in column unique_pid2
. 我对存储在
unique_pid2
列中的人员ID号使用groupby
。
import pandas as pd
gr_data = pd.read_csv("M:/test.csv").groupby('unique_pid2')
group = gr_data.get_group('5904_181')
print group
Each group looks like this: 每个组如下所示:
unique_pid2 year age moved
798908 5904_181 1983 0 0
798909 5904_181 1984 0 0
798910 5904_181 1985 0 0
798911 5904_181 1986 0 0
798912 5904_181 1987 2 5
798913 5904_181 1988 0 5
798914 5904_181 1989 0 0
798915 5904_181 1990 0 0
798916 5904_181 1991 0 0
798917 5904_181 1992 0 0
798918 5904_181 1993 0 0
798928 5904_181 2009 24 5
798929 5904_181 2011 26 1
For each group, I want to fill in values that are equal to zero in BOTH the moved
and age
columns with alternate values, but ONLY if these observations are "sandwiched" between other observations with at least one non-zero value in the age
and moved
columns. 对于每个组,我都希望在“
moved
和“ age
列中都使用交替值填充等于零的值,但是仅当这些观察值“夹在”其他观察值之间且age
和age
中至少有一个非零值age
,才可以moved
列。
For example, in the above group, I want to fill in lines 798914: 798918
, but not 798908:798911
.. For the observations that have both age
and moved
values equal to 0, I have written a function that replaces the zeros in accordingly. 例如,在上面的组中,我想填写行
798914: 798918
,而不是798908:798911
..对于age
和moved
值均等于0的观测值,我编写了一个函数,该函数相应地替换了零。 But I want to call this function on the "sandwich" cases like 798914: 798918
, and don't know how to access those rows. 但我想在“三明治”情况下(例如
798914: 798918
调用此函数,并且不知道如何访问这些行。
So far, I have tried something like: 到目前为止,我已经尝试过类似的方法:
group.loc[(group["age"] == 0) & (group["moved"] == 0), ['age', 'moved']] = someFunction(group)
But this fills in the non-sandwiched observations, like the first four rows in the above group. 但这填充了非夹心的观测值,如上述组中的前四行。 How should I go about applying a function to fill in
age
and moved
values equal to 0 in each group, but only for observations that are sandwiched between observations with non-zero values in either age
, moved
, or both? 我应该如何应用一个函数来填充
age
和每个组中等于0的moved
值,但仅适用于夹在age
, moved
或两者中具有非零值的观测值之间的观测值?
Assuming the values in age
and moved
are non-negative, you could select the desired rows using cumsum
: 假设中的值
age
和moved
都是非负的,你可以选择使用需要的行cumsum
:
mask = ((grp['age'].cumsum()>0) & (grp['moved'].cumsum()>0)
& (grp['age'] == 0) & (grp['moved'] == 0))
since when the cumulative sum is greater than 0, there must have been a preceding positive value. 因为当累计总和大于0时,必须有一个在前的正值。
For example, 例如,
import pandas as pd
df = pd.read_csv("M:/test.csv")
gr_data = df.groupby('unique_pid2')
def foo(grp):
mask = ((grp['age'].cumsum()>0) & (grp['moved'].cumsum()>0)
& (grp['age'] == 0) & (grp['moved'] == 0))
grp.loc[mask, ['age', 'moved']] = 'foo'
return grp
df = gr_data.apply(foo)
print(df)
yields 产量
unique_pid2 year age moved
0 5904_181 1983 0 0
1 5904_181 1984 0 0
2 5904_181 1985 0 0
3 5904_181 1986 0 0
4 5904_181 1987 2 5
5 5904_181 1988 0 5
6 5904_181 1989 foo foo
7 5904_181 1990 foo foo
8 5904_181 1991 foo foo
9 5904_181 1992 foo foo
10 5904_181 1993 foo foo
11 5904_181 2009 24 5
12 5904_181 2011 26 1
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.