[英]Get number of maximum values per column in pandas
I have the following dataframe with timeseries data per day: 我有以下数据框,每天有时间序列数据:
time-orig 00:15:00 00:30:00 00:45:00 01:00:00
date
2010-01-04 1164.3 1163.5 1162.8 1161.8
2010-01-05 1186.3 1185.8 1185.6 1185.0
2010-01-06 1181.5 1181.5 1182.7 1182.3
2010-01-07 1202.1 1201.9 1201.7 1200.8
Now I want to get the number of maximum values per column like this: 现在我想获得每列的最大值数量,如下所示:
'00:15:00' : 3
'00:30:00' : 0
'00:45:00' : 1
'01:00:00' : 0
(ie: the column '00:15:00' has 3 maxima, looking at maximum per row.) (即:列'00:15:00'有3个最大值,每行最大值。)
I know I could transpose the dataframe and run a loop over the columns and use idxmax(), but my question is if there is a vectorized/better way of doing this? 我知道我可以转置数据帧并在列上运行循环并使用idxmax(),但我的问题是,是否有一个矢量化/更好的方法来做到这一点?
One approach would be to use np.argmax
on the underlying array data and then do binned-count on the max indices with np.bincount
- 一种方法是在底层数组数据上使用
np.argmax
,然后使用np.argmax
对最大索引进行np.bincount
-count计算 -
np.bincount(df.iloc[:,1:].values.argmax(1), minlength=df.shape[1]-1)
Sample run - 样品运行 -
In [141]: df
Out[141]:
time-orig 00:15:00 00:30:00 00:45:00 01:00:00
0 2010-01-04 1164.3 1163.5 1162.8 1161.8
1 2010-01-05 1186.3 1185.8 1185.6 1185.0
2 2010-01-06 1181.5 1181.5 1182.7 1182.3
3 2010-01-07 1202.1 1201.9 1201.7 1200.8
In [142]: c = np.bincount(df.iloc[:,1:].values.argmax(1), minlength=df.shape[1]-1)
In [143]: c
Out[143]: array([3, 0, 1, 0])
In [144]: np.c_[df.columns[1:], c]
Out[144]:
array([['00:15:00', 3],
['00:30:00', 0],
['00:45:00', 1],
['01:00:00', 0]], dtype=object)
Assumption made here is that date
is the index. 这里假设的是
date
是索引。 You can use df.idxmax
followed by df.value_counts
: 您可以使用
df.idxmax
然后使用df.value_counts
:
print(df)
time-orig 00:15:00 00:30:00 00:45:00 01:00:00
date
2010-01-04 1164.3 1163.5 1162.8 1161.8
2010-01-05 1186.3 1185.8 1185.6 1185.0
2010-01-06 1181.5 1181.5 1182.7 1182.3
2010-01-07 1202.1 1201.9 1201.7 1200.8
s = df.idxmax(1).value_counts().reindex(df.columns, fill_value=0)
print(s)
time-orig
00:15:00 3
00:30:00 0
00:45:00 1
01:00:00 0
dtype: int64
Divakar's solution is quite fast if you want a numpy array. 如果你想要一个numpy阵列,Divakar的解决方案非常快。 For your exact data, a slight modification is needed to his answer:
对于您的确切数据,他的答案需要稍作修改:
val = np.bincount(df.values.argmax(1), minlength=df.shape[1])
s = pd.Series(val, df.columns)
print(s)
time-orig
00:15:00 3
00:30:00 0
00:45:00 1
01:00:00 0
dtype: int64
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.