简体   繁体   中英

Python pandas: replace select values in groupby object

I have a large dataframe with individual-level data in four columns: a person id number, her year, her age, and her moving status. I use groupby on the person id number, stored in column unique_pid2 .

import pandas as pd 

gr_data = pd.read_csv("M:/test.csv").groupby('unique_pid2')

group = gr_data.get_group('5904_181')

print group

Each group looks like this:

       unique_pid2  year  age  moved
798908    5904_181  1983    0      0
798909    5904_181  1984    0      0
798910    5904_181  1985    0      0
798911    5904_181  1986    0      0
798912    5904_181  1987    2      5
798913    5904_181  1988    0      5
798914    5904_181  1989    0      0
798915    5904_181  1990    0      0
798916    5904_181  1991    0      0
798917    5904_181  1992    0      0
798918    5904_181  1993    0      0
798928    5904_181  2009   24      5
798929    5904_181  2011   26      1

For each group, I want to fill in values that are equal to zero in BOTH the moved and age columns with alternate values, but ONLY if these observations are "sandwiched" between other observations with at least one non-zero value in the age and moved columns.

For example, in the above group, I want to fill in lines 798914: 798918 , but not 798908:798911 .. For the observations that have both age and moved values equal to 0, I have written a function that replaces the zeros in accordingly. But I want to call this function on the "sandwich" cases like 798914: 798918 , and don't know how to access those rows.

So far, I have tried something like:

group.loc[(group["age"] == 0) & (group["moved"] == 0), ['age', 'moved']] = someFunction(group)

But this fills in the non-sandwiched observations, like the first four rows in the above group. How should I go about applying a function to fill in age and moved values equal to 0 in each group, but only for observations that are sandwiched between observations with non-zero values in either age , moved , or both?

Assuming the values in age and moved are non-negative, you could select the desired rows using cumsum :

mask = ((grp['age'].cumsum()>0) & (grp['moved'].cumsum()>0)
        & (grp['age'] == 0) & (grp['moved'] == 0))

since when the cumulative sum is greater than 0, there must have been a preceding positive value.

For example,

import pandas as pd

df = pd.read_csv("M:/test.csv")
gr_data = df.groupby('unique_pid2')
def foo(grp):
    mask = ((grp['age'].cumsum()>0) & (grp['moved'].cumsum()>0)
            & (grp['age'] == 0) & (grp['moved'] == 0))
    grp.loc[mask, ['age', 'moved']] = 'foo'
    return grp
df = gr_data.apply(foo)
print(df)

yields

   unique_pid2  year  age moved
0     5904_181  1983    0     0
1     5904_181  1984    0     0
2     5904_181  1985    0     0
3     5904_181  1986    0     0
4     5904_181  1987    2     5
5     5904_181  1988    0     5
6     5904_181  1989  foo   foo
7     5904_181  1990  foo   foo
8     5904_181  1991  foo   foo
9     5904_181  1992  foo   foo
10    5904_181  1993  foo   foo
11    5904_181  2009   24     5
12    5904_181  2011   26     1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM