简体   繁体   中英

Count first consecutive matches on a group

I am quite new to Pandas, I am trying to count the total of the first consecutive instances of color from this DataFrame

    car   color
0   audi  black
1   audi  black
2   audi   blue
3   audi  black
4    bmw   blue
5    bmw  green
6    bmw   blue
7    bmw   blue
8   fiat  green
9   fiat  green
10  fiat  green
11  fiat   blue

Thanks to jezrael I have it so it counts the cumulative number of times the first color appears with this :

import pandas as pd

df = pd.DataFrame(data={
  'car': ['audi', 'audi', 'audi', 'audi', 'bmw', 'bmw', 'bmw', 'bmw', 'fiat', 'fiat', 'fiat', 'fiat'],'color': ['black', 'black', 'blue', 'black', 'blue', 'green', 'blue', 'blue', 'green', 'green', 'green', 'blue']
})

df1 = (df.groupby('car')['color']
          .transform('first')
          .eq(df['color'])
          .view('i1')
          .groupby(df['car'])
          .sum()
          .reset_index(name='colour_cars'))

print(df1)

And it works well for counting the total

    car  colour_cars
0  audi            3
1   bmw            3
2  fiat            3

But it turns out what I really need is to count the first consecutive sum, so it should be

    car  colour_cars
0  audi            2
1   bmw            1
2  fiat            3

I have tried to use an apply function to stop the series .sum() if a False is encounter by .eq , any help to find a way to break the count once a False is returned from the .eq would be greatly appreciated.

Use:

df = (df.groupby(['car', df.color.ne(df.color.shift()).cumsum()])
        .size()
        .reset_index(level=1, drop=True)
        .reset_index(name='colour_cars')
        .drop_duplicates('car'))

print (df)
    car  colour_cars
0  audi            2
3   bmw            1
6  fiat            3

Details :

Create helper consecutive Series for test consecutive values of color column, pass to GroupBy.size , remove first level created from helper function by DataFrame.reset_index , convert index to columns by second reset_index and last get first rows per cars by DataFrame.drop_duplicates :

print (df.color.ne(df.color.shift()).cumsum())
0     1
1     1
2     2
3     3
4     4
5     5
6     6
7     6
8     7
9     7
10    7
11    8
Name: color, dtype: int32

Here is a slightly different approach:

# get group ids based on whether the car or the color changes from one row to the next
df = df.assign(group_id=(df.shift(1) != df).any(axis=1).cumsum())

# group and get len of consecutive identical pairs
df = df.join(df.groupby('group_id').apply(len).rename('consec_len'), on='group_id')

# select first length for each car
df1.groupby('car').consec_len.first()

df1
# returns
car
audi    2
bmw     1
fiat    3
Name: consec_len, dtype: int64

You could do:

# group by car and consecutive group of colors (compute count)
counts = df.groupby(['car', df.color.ne(df.color.shift()).cumsum()], as_index=False).count()

# fetch only the count corresponding to the first consecutive group of colors
result = counts[~counts.car.duplicated()].rename(columns={'color' : 'colour_cars'})

print(result)

Output

    car  colour_cars
0  audi            2
3   bmw            1
6  fiat            3

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM