[英]Grouping Unique Strings in a Column and Performing Function On Separate Column Values
在我的 dataframe 中,我有“away_lineup”列,其中包含 5 個字符串的分組,還有一個“play_length”列,其中每行都有一個持續時間值。 我知道 np.unique 可以檢測唯一的字符串值並且 np.sum 值在列中添加值,但是我如何使用像 np.unique 這樣的 function 來檢測每個唯一的字符串並將字符串的“play_length”值相加連續發生?
away_lineup play_length
0 Dario Saric, Robert Covington, Joel Embiid, Markelle Fultz, Ben Simmons 0:00:05
1 Dario Saric, Robert Covington, Joel Embiid, Markelle Fultz, Ben Simmons 0:00:10
2 Dario Saric, Robert Covington, Joel Embiid, Markelle Fultz, Ben Simmons 0:00:20
3 Dario Saric, Robert Covington, Joel Embiid, Markelle Fultz, Ben Simmons 0:00:07
4 Dario Saric, Robert Covington, Joel Embiid, Markelle Fultz, Ben Simmons 0:00:25
5 Dario Saric, Robert Covington, Joel Embiid, Markelle Fultz, JJ Redick 0:00:14
我想要的 output 會是這樣的
player play_length
Dario Saric 0:01:21
Robert Covington 0:01:21
Joel Embiid 0:01:21
Markelle Fultz 0:01:21
Ben Simmons 0:01:07
JJ Redick 0:00:14
從“away_lineup”中提取唯一名稱,存儲在新列“player”中,並且存在 player 值的行添加了它們的“play_length”值。
使用pandas.DataFrame.explode
和pandas.to_timedelta
:
注意: pandas.DataFrame.explode
適用於pandas
>= 0.25
df['away_lineup'] = df['away_lineup'].str.split(', ')
df['play_length'] = pd.to_timedelta(df['play_length'])
new_df = df.explode('away_lineup').groupby('away_lineup').sum()
print(new_df)
Output:
play_length
away_lineup
Ben Simmons 00:01:07
Dario Saric 00:01:21
JJ Redick 00:00:14
Joel Embiid 00:01:21
Markelle Fultz 00:01:21
Robert Covington 00:01:21
如果您的 pandas 不支持explode
:
df['play_length'] = pd.to_timedelta(df['play_length'])
new_df = pd.concat((df[['play_length']],
df['away_lineup'].str.split(',\s*', expand=True)),
axis=1)
(new_df.melt(id_vars=['play_length'],
value_vars=new_df.columns[1:],
value_name='artist')
.groupby('artist').play_length.sum()
)
Output:
artist
Ben Simmons 00:01:07
Dario Saric 00:01:21
JJ Redick 00:00:14
Joel Embiid 00:01:21
Markelle Fultz 00:01:21
Robert Covington 00:01:21
Name: play_length, dtype: timedelta64[ns]
檢查get_dummies
的技巧
#df['play_length'] = pd.to_timedelta(df['play_length'])
df.away_lineup.str.get_dummies(',').mul(df.play_length,0).sum()
Out[372]:
Ben Simmons 00:01:07
JJ Redick 00:00:14
Joel Embiid 00:01:21
Markelle Fultz 00:01:21
Robert Covington 00:01:21
Dario Saric 00:01:21
dtype: timedelta64[ns]
你可以像這樣使用explode和group by
import numpy as np
import pandas as pd
## create dummy data
arr = [("Dario Saric, Robert Covington, Joel Embiid, Markelle Fultz, Ben Simmons", "00:00:05"),
("Dario Saric, Robert Covington, Joel Embiid, Markelle Fultz, Ben Simmons", "00:00:10"),
("Dario Saric, Robert Covington, Joel Embiid, Markelle Fultz, Ben Simmons", "00:00:20"),
("Dario Saric, Robert Covington, Joel Embiid, Markelle Fultz, Ben Simmons", "00:00:07"),
("Dario Saric, Robert Covington, Joel Embiid, Markelle Fultz, Ben Simmons", "00:00:25"),
("Dario Saric, Robert Covington, Joel Embiid, Markelle Fultz, JJ Redick", "00:00:14"),]
df = pd.DataFrame(arr, columns=["Player", "Play Time"])
df["Play Time"] = pd.to_timedelta(df["Play Time"])
## Solution
df["Player"] = df["Player"].str.split(",")
df.explode("Player").groupby("Player").sum()
output
Play Time
Player
Ben Simmons 00:01:07
JJ Redick 00:00:14
Joel Embiid 00:01:21
Markelle Fultz 00:01:21
Robert Covington 00:01:21
Dario Saric 00:01:21
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.