I am trying to find the longest streak of string values and also where it is. The data I have is formatted like this:
ID Datetime Name
0 Date1, Harald
1 Date2, Harald
2 Date3, Esther
3 Date4, Steve
4 Date5, Esther
5 Date6, Esther
6 Date7, Esther
The expected output would be this, the largest streak by string value and either date or row number
Output = {
Harald: 2, 0 or Date1
Esther: 3, 4 or Date5
Steve: 1, 3 or Date4
}
My solution that got closest was this:
def getLongestStreak():
s = df['Name']
for index, (key, group) in enumerate(groupby(s.tolist())):
grouplength = len(list(group))
if key in longestStreakDict:
if longestStreakDict[key][0] < grouplength:
longestStreakDict[key] = grouplength, index
else:
longestStreakDict[key] = grouplength, index
This unfortunately only returns the longest streak with the number of times the group changed in the groupby iterator, also it uses itertools and will be slow for for large datasets.
{'Harald': (2, 1), 'Esther': (3, 3), 'Steve': (1, 2)}
Does anyone know a non-iterating solution that also returns the proper row index?
We can use Series.cumsum
+ Series.shift
to create groups according to consecutive names(see detail). Then you can use GroupBy.agg
to create a dataframe with the size of each group. ,the first index and datetime value of each group. Sort the dataframe by size using DataFrame.sort_values
and remove duplicates (You can use DataFrame.drop_duplicates
) to remove groups with the same name and smaller size. Convert the columns to str. (You may need to convert Datetime also if your actual data is not str). Then you can use Series.str.cat
to join the columns. Finally, we can use Series.to_dict
+ DataFrame.set_index
to obtaind the dictionary
groups=df['Name'].ne(df['Name'].shift()).cumsum()
df_agg= ( df.groupby(groups,sort=False).agg(Name=('Name','first'),
Datemin=('Datetime','first'),
length=('Name','size'),
idxmin=('ID','idxmin'))
.sort_values('length',ascending=False)
.drop_duplicates('Name')
)
df_agg['j1']=df_agg['length'].astype(str).str.cat(df_agg['idxmin'].astype(str),sep=',')
df_agg['j']=df_agg['j1'].str.cat(df_agg['Datemin'],sep=' or ')
print(df_agg)
Name length idxmin Datemin j1 j
Name
4 Esther 3 4 Date5 3,4 3,4 or Date5
1 Harald 2 0 Date1 2,0 2,0 or Date1
3 Steve 1 3 Date4 1,3 1,3 or Date4
my_dict=df_agg.set_index('Name')['j'].to_dict()
print(my_dict)
Output
{'Esther': '3,4 or Date5', 'Harald': '2,0 or Date1', 'Steve': '1,3 or Date4'}
Detail:
print(groups)
0 1
1 1
2 2
3 3
4 4
5 4
6 4
Name: Name, dtype: int64
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.