简体   繁体   中英

Pandas data frame with groupby: How to create indicator variable for the first and last rows in each group

Suppose I have a data frame like this:

      X
  0  10
  1  10
  2  10
  3  10
  4  20
  5  20
  6  30
  7  30
  8  30
  9  30

and I plan to use it in df.groupby(['X']).apply(function) operation. I want to create additional columns with indicator variables to mark the rows where each group starts and finishes. I want to create a new frame like this (I abbreviated False to F)

     X  First_X  Last_X
0  10  True     F
1  10  F        F
2  10  F        F
3  10  F        True
4  20  True     F
5  20  F        True
6  30  True     F
7  30  F        F
8  30  F        F
9  30  F        True

How would I do it?

The same question in a case where I do groupby operation with two or more columns. For example: df.groupby(['X','Y']).apply(function) . For the second variable, I mark the first and the last row within the group created by the first variable.

     X     Y
0  10    1
1  10    1
2  10    2
3  10    2
4  20    3
5  20    4
6  30    5
7  30    5
8  30    5
9  30    6

and a resulting frame should be

    X    Y   First_X  Last_X  First_Y  Last_Y
0  10    1   True     F       True     F
1  10    1   F        F       F        True
2  10    2   F        F       True     F
2  10    2   F        True    F        True
3  20    3   True     F       True     True
4  20    4   F        True    True     True
5  30    5   True     F       True     F
6  30    5   F        F       F        F
7  30    5   F        F       F        True
8  30    6   F        True    True     True

Is using DataFrame.shift and DataFrame.merge is the right way to approach the problem?

Thank you.

First Question;

df=df.assign(First_X=df.X.ne(df.X.shift()),Last_X=df.X.ne(df.X.shift(-1)))

Second one

print(df3)

    X  Y First_X Last_X
0  10  1    True      F
1  10  1       F      F
2  10  2       F      F
2  10  2       F   True
3  20  3    True      F
4  20  4       F   True
5  30  5    True      F
6  30  5       F      F
7  30  5       F      F
8  30  6       F   True



df3=df3.assign(First_Y=df3.groupby(['X','Y'])['Y']\
    .apply(lambda x: x.ne(x.shift())),Last_Y=df3.groupby\
    (['X','Y'])['Y'].apply(lambda x: x.ne(x.shift(-1))))



    X  Y First_X Last_X  First_Y  Last_Y
0  10  1    True      F     True   False
1  10  1       F      F    False    True
2  10  2       F      F     True   False
2  10  2       F   True    False    True
3  20  3    True      F     True    True
4  20  4       F   True     True    True
5  30  5    True      F     True   False
6  30  5       F      F    False   False
7  30  5       F      F    False    True
8  30  6       F   True     True    True

For the first question, inspired by the similar question here :

df['first'] = False
df['last'] = False

def set_cols(df):
  df['first'].iloc[0] = True
  df['last'].iloc[-1] = True
  return df

df = df.groupby('X').apply(set_cols)

Gives the desired result.

df.assign(
first_ind=lambda df: pd.Series(data=1, index=df.groupby('X')['Y'].idxmin()),
last_ind=lambda df: pd.Series(data=1, index=df.groupby('X')['Y'].idxmax()))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM