I have a large dataframe with individual-level data in four columns: a person id number, her year, her age, and her moving status. I use groupby
on the person id number, stored in column unique_pid2
.
import pandas as pd
gr_data = pd.read_csv("M:/test.csv").groupby('unique_pid2')
group = gr_data.get_group('5904_181')
print group
Each group looks like this:
unique_pid2 year age moved
798908 5904_181 1983 0 0
798909 5904_181 1984 0 0
798910 5904_181 1985 0 0
798911 5904_181 1986 0 0
798912 5904_181 1987 2 5
798913 5904_181 1988 0 5
798914 5904_181 1989 0 0
798915 5904_181 1990 0 0
798916 5904_181 1991 0 0
798917 5904_181 1992 0 0
798918 5904_181 1993 0 0
798928 5904_181 2009 24 5
798929 5904_181 2011 26 1
For each group, I want to fill in values that are equal to zero in BOTH the moved
and age
columns with alternate values, but ONLY if these observations are "sandwiched" between other observations with at least one non-zero value in the age
and moved
columns.
For example, in the above group, I want to fill in lines 798914: 798918
, but not 798908:798911
.. For the observations that have both age
and moved
values equal to 0, I have written a function that replaces the zeros in accordingly. But I want to call this function on the "sandwich" cases like 798914: 798918
, and don't know how to access those rows.
So far, I have tried something like:
group.loc[(group["age"] == 0) & (group["moved"] == 0), ['age', 'moved']] = someFunction(group)
But this fills in the non-sandwiched observations, like the first four rows in the above group. How should I go about applying a function to fill in age
and moved
values equal to 0 in each group, but only for observations that are sandwiched between observations with non-zero values in either age
, moved
, or both?
Assuming the values in age
and moved
are non-negative, you could select the desired rows using cumsum
:
mask = ((grp['age'].cumsum()>0) & (grp['moved'].cumsum()>0)
& (grp['age'] == 0) & (grp['moved'] == 0))
since when the cumulative sum is greater than 0, there must have been a preceding positive value.
For example,
import pandas as pd
df = pd.read_csv("M:/test.csv")
gr_data = df.groupby('unique_pid2')
def foo(grp):
mask = ((grp['age'].cumsum()>0) & (grp['moved'].cumsum()>0)
& (grp['age'] == 0) & (grp['moved'] == 0))
grp.loc[mask, ['age', 'moved']] = 'foo'
return grp
df = gr_data.apply(foo)
print(df)
yields
unique_pid2 year age moved
0 5904_181 1983 0 0
1 5904_181 1984 0 0
2 5904_181 1985 0 0
3 5904_181 1986 0 0
4 5904_181 1987 2 5
5 5904_181 1988 0 5
6 5904_181 1989 foo foo
7 5904_181 1990 foo foo
8 5904_181 1991 foo foo
9 5904_181 1992 foo foo
10 5904_181 1993 foo foo
11 5904_181 2009 24 5
12 5904_181 2011 26 1
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.