[英]Filtering Groups With Pandas
我正在尝试使用熊猫将过滤器添加到组中。 在下面的棒球数据中,我希望计算出归纳栏中从初始“ N”到最终“ Y”所需的平均时间。 本质上,我希望计算的是在归纳列中包括“ Y”并且具有多行的每组的长度。 任何提示都会有所帮助!
playerID yearid votedBy ballots needed votes inducted category needed_note
2860 aaronha01 1982 BBWAA 415 312 406 Y Player NaN
3743 abbotji01 2005 BBWAA 516 387 13 N Player NaN
146 adamsba01 1937 BBWAA 201 151 8 N Player NaN
259 adamsba01 1938 BBWAA 262 197 11 N Player NaN
384 adamsba01 1939 BBWAA 274 206 11 N Player NaN
497 adamsba01 1942 BBWAA 233 175 11 N Player NaN
574 adamsba01 1945 BBWAA 247 186 7 N Player NaN
2108 adamsbo03 1966 BBWAA 302 227 1 N Player NaN
我修改了数据集,以便有两个这样的组。 一个从N
到Y
有2行,另一个从N
到Y
有8行。 这取决于您是否计算包含y
的行。 如果没有,它将分为两组,一组包含1行,另一组包含7行。 看起来您没有时间序列列,因此我想这意味着各行按时间均匀分布。
In [25]:
df=pd.read_clipboard()
print df
playerID yearid votedBy ballots needed votes inducted category needed_note
3741 abbotji01 2005 BBWAA 516 387 13 N Player NaN
2860 aaronha01 1982 BBWAA 415 312 406 Y Player NaN
3743 abbotji01 2005 BBWAA 516 387 13 N Player NaN
146 adamsba01 1937 BBWAA 201 151 8 N Player NaN
259 adamsba01 1938 BBWAA 262 197 11 N Player NaN
384 adamsba01 1939 BBWAA 274 206 11 N Player NaN
497 adamsba01 1942 BBWAA 233 175 11 N Player NaN
574 adamsba01 1945 BBWAA 247 186 7 N Player NaN
2108 adamsbo03 1966 BBWAA 302 227 1 N Player NaN
2861 aaronha01 1982 BBWAA 415 312 406 Y Player NaN
In [26]:
df['isY']=(df.inducted=='Y')
df['isY']=np.hstack((0,df['isY'].cumsum().values[:-1])).T
In [27]:
print df.groupby('isY').count()
playerID yearid votedBy ballots needed votes inducted category needed_note isY
0 2 2 2 2 2 2 2 2 0 2
1 8 8 8 8 8 8 8 8 0 8
[2 rows x 10 columns]
假设您不计算Y
,则均值可以通过以下方式计算:
df2=df.groupby('isY').count().isY-1
df2[df2!=1].mean()
我模拟了自己的数据,以轻松测试您的问题。 我创建了一组名为df_inducted的播放器,其中包括最终被引入的播放器,然后通过使用isin()函数,我们可以确保仅在分析中考虑它们。 然后,我找到它们的日期的最小值和最大值,并将它们的差平均。
import pandas as pd
df = pd.DataFrame({'player':['Nate','Will','Nate','Will'],
'inducted': ['Y','Y','N','N'],
'date':[2014,2000,2011,1999]})
df_inducted = df[df.inducted=='Y']
df_subset = df[df.player.isin(df_inducted.player)]
maxs = df_subset.groupby('player')['date'].max()
mins = df_subset.groupby('player')['date'].min()
maxs = pd.DataFrame(maxs)
maxs.columns = ['max_date']
mins = pd.DataFrame(mins)
mins.columns = ['min_date']
min_and_max = maxs.join(mins)
final = min_and_max['max_date'] - min_and_max['min_date']
print "average time:", final.mean()
类DataFrameGroupBy的过滤方法在组中的每个子帧上运行。 请参阅help(pd.core.groupby.DataFrameGroupBy.filter)
。 数据是:
print df
inducted playerID
0 Y a
1 N a
2 N a
3 Y b
4 N b
5 N c
6 N c
7 N c
示例代码:
import pandas as pd
g = df.groupby('playerID')
madeit = g.filter(
lambda subframe:
'Y' in set(subframe.inducted)).groupby('playerID')
# The filter removed player 'c' who never has inducted == 'Y'
print madeit.head()
inducted playerID
playerID
a 0 Y a
1 N a
2 N a
b 3 Y b
4 N b
# The 'aggregate' function applies a function to each subframe
print madeit.aggregate(len)
inducted
playerID
a 3
b 2
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.