简体   繁体   English

Python Pandas:替换groupby对象中的选择值

[英]Python pandas: replace select values in groupby object

I have a large dataframe with individual-level data in four columns: a person id number, her year, her age, and her moving status. 我有一个很大的数据框,其中包含四列的个人级别数据:一个人的身分证件号码,她的年龄,她的年龄和她的移动状态。 I use groupby on the person id number, stored in column unique_pid2 . 我对存储在unique_pid2列中的人员ID号使用groupby

import pandas as pd 

gr_data = pd.read_csv("M:/test.csv").groupby('unique_pid2')

group = gr_data.get_group('5904_181')

print group

Each group looks like this: 每个组如下所示:

       unique_pid2  year  age  moved
798908    5904_181  1983    0      0
798909    5904_181  1984    0      0
798910    5904_181  1985    0      0
798911    5904_181  1986    0      0
798912    5904_181  1987    2      5
798913    5904_181  1988    0      5
798914    5904_181  1989    0      0
798915    5904_181  1990    0      0
798916    5904_181  1991    0      0
798917    5904_181  1992    0      0
798918    5904_181  1993    0      0
798928    5904_181  2009   24      5
798929    5904_181  2011   26      1

For each group, I want to fill in values that are equal to zero in BOTH the moved and age columns with alternate values, but ONLY if these observations are "sandwiched" between other observations with at least one non-zero value in the age and moved columns. 对于每个组,我都希望在“ moved和“ age列中都使用交替值填充等于零的值,但是仅当这些观察值“夹在”其他观察值之间且ageage中至少有一个非零值age ,才可以moved列。

For example, in the above group, I want to fill in lines 798914: 798918 , but not 798908:798911 .. For the observations that have both age and moved values equal to 0, I have written a function that replaces the zeros in accordingly. 例如,在上面的组中,我想填写行798914: 798918 ,而不是798908:798911 ..对于agemoved值均等于0的观测值,我编写了一个函数,该函数相应地替换了零。 But I want to call this function on the "sandwich" cases like 798914: 798918 , and don't know how to access those rows. 但我想在“三明治”情况下(例如798914: 798918调用此函数,并且不知道如何访问这些行。

So far, I have tried something like: 到目前为止,我已经尝试过类似的方法:

group.loc[(group["age"] == 0) & (group["moved"] == 0), ['age', 'moved']] = someFunction(group)

But this fills in the non-sandwiched observations, like the first four rows in the above group. 但这填充了非夹心的观测值,如上述组中的前四行。 How should I go about applying a function to fill in age and moved values equal to 0 in each group, but only for observations that are sandwiched between observations with non-zero values in either age , moved , or both? 我应该如何应用一个函数来填充age和每个组中等于0的moved值,但仅适用于夹在agemoved或两者中具有非零值的观测值之间的观测值?

Assuming the values in age and moved are non-negative, you could select the desired rows using cumsum : 假设中的值agemoved都是非负的,你可以选择使用需要的行cumsum

mask = ((grp['age'].cumsum()>0) & (grp['moved'].cumsum()>0)
        & (grp['age'] == 0) & (grp['moved'] == 0))

since when the cumulative sum is greater than 0, there must have been a preceding positive value. 因为当累计总和大于0时,必须有一个在前的正值。

For example, 例如,

import pandas as pd

df = pd.read_csv("M:/test.csv")
gr_data = df.groupby('unique_pid2')
def foo(grp):
    mask = ((grp['age'].cumsum()>0) & (grp['moved'].cumsum()>0)
            & (grp['age'] == 0) & (grp['moved'] == 0))
    grp.loc[mask, ['age', 'moved']] = 'foo'
    return grp
df = gr_data.apply(foo)
print(df)

yields 产量

   unique_pid2  year  age moved
0     5904_181  1983    0     0
1     5904_181  1984    0     0
2     5904_181  1985    0     0
3     5904_181  1986    0     0
4     5904_181  1987    2     5
5     5904_181  1988    0     5
6     5904_181  1989  foo   foo
7     5904_181  1990  foo   foo
8     5904_181  1991  foo   foo
9     5904_181  1992  foo   foo
10    5904_181  1993  foo   foo
11    5904_181  2009   24     5
12    5904_181  2011   26     1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM