简体   繁体   中英

Pandas - Find longest streak of string values in column together with row id

I am trying to find the longest streak of string values and also where it is. The data I have is formatted like this:

ID Datetime Name 
0  Date1,   Harald
1  Date2,   Harald
2  Date3,   Esther
3  Date4,   Steve
4  Date5,   Esther
5  Date6,   Esther
6  Date7,   Esther

The expected output would be this, the largest streak by string value and either date or row number

Output = {
    Harald: 2, 0 or Date1
    Esther: 3, 4 or Date5
    Steve: 1, 3 or Date4
}

My solution that got closest was this:

def getLongestStreak():
    s = df['Name']

    for index, (key, group) in enumerate(groupby(s.tolist())):
        grouplength = len(list(group))
        if key in longestStreakDict:
            if longestStreakDict[key][0] < grouplength:
                longestStreakDict[key] = grouplength, index
        else:
            longestStreakDict[key] = grouplength, index

This unfortunately only returns the longest streak with the number of times the group changed in the groupby iterator, also it uses itertools and will be slow for for large datasets.

{'Harald': (2, 1), 'Esther': (3, 3), 'Steve': (1, 2)}

Does anyone know a non-iterating solution that also returns the proper row index?

We can use Series.cumsum + Series.shift to create groups according to consecutive names(see detail). Then you can use GroupBy.agg to create a dataframe with the size of each group. ,the first index and datetime value of each group. Sort the dataframe by size using DataFrame.sort_values and remove duplicates (You can use DataFrame.drop_duplicates ) to remove groups with the same name and smaller size. Convert the columns to str. (You may need to convert Datetime also if your actual data is not str). Then you can use Series.str.cat to join the columns. Finally, we can use Series.to_dict + DataFrame.set_index to obtaind the dictionary

groups=df['Name'].ne(df['Name'].shift()).cumsum()
df_agg= (   df.groupby(groups,sort=False).agg(Name=('Name','first'),
                                              Datemin=('Datetime','first'),
                                              length=('Name','size'),
                                              idxmin=('ID','idxmin'))
              .sort_values('length',ascending=False)
              .drop_duplicates('Name')
        )


df_agg['j1']=df_agg['length'].astype(str).str.cat(df_agg['idxmin'].astype(str),sep=',')
df_agg['j']=df_agg['j1'].str.cat(df_agg['Datemin'],sep=' or ')
print(df_agg)

        Name  length  idxmin Datemin   j1             j
Name                                                  
4     Esther       3       4   Date5  3,4  3,4 or Date5
1     Harald       2       0   Date1  2,0  2,0 or Date1
3      Steve       1       3   Date4  1,3  1,3 or Date4

my_dict=df_agg.set_index('Name')['j'].to_dict()
print(my_dict)

Output

{'Esther': '3,4 or Date5', 'Harald': '2,0 or Date1', 'Steve': '1,3 or Date4'}

Detail:

print(groups)

0    1
1    1
2    2
3    3
4    4
5    4
6    4
Name: Name, dtype: int64

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM