[英]How to count non NaN values accross columns in pandas dataframe?
我的数据如下所示:
Close a b c d e Time
2015-12-03 2051.25 5 4 3 1 1 05:00:00
2015-12-04 2088.25 5 4 3 1 NaN 06:00:00
2015-12-07 2081.50 5 4 3 NaN NaN 07:00:00
2015-12-08 2058.25 5 4 NaN NaN NaN 08:00:00
2015-12-09 2042.25 5 NaN NaN NaN NaN 09:00:00
我需要“水平”计算从[N]到[N]列的值。 因此结果将是这样的:
df['Count'] = .....
df
Close a b c d e Time Count
2015-12-03 2051.25 5 4 3 1 1 05:00:00 5
2015-12-04 2088.25 5 4 3 1 NaN 06:00:00 4
2015-12-07 2081.50 5 4 3 NaN NaN 07:00:00 3
2015-12-08 2058.25 5 4 NaN NaN NaN 08:00:00 2
2015-12-09 2042.25 5 NaN NaN NaN NaN 09:00:00 1
谢谢
您可以从df中进行子选择,并通过axis=1
呼叫count
:
In [24]:
df['count'] = df[list('abcde')].count(axis=1)
df
Out[24]:
Close a b c d e Time count
2015-12-03 2051.25 5 4 3 1 1 05:00:00 5
2015-12-04 2088.25 5 4 3 1 NaN 06:00:00 4
2015-12-07 2081.50 5 4 3 NaN NaN 07:00:00 3
2015-12-08 2058.25 5 4 NaN NaN NaN 08:00:00 2
2015-12-09 2042.25 5 NaN NaN NaN NaN 09:00:00 1
时间安排
In [25]:
%timeit df[['a', 'b', 'c', 'd', 'e']].apply(lambda x: sum(x.notnull()), axis=1)
%timeit df.drop(['Close', 'Time'], axis=1).count(axis=1)
%timeit df[list('abcde')].count(axis=1)
100 loops, best of 3: 3.28 ms per loop
100 loops, best of 3: 2.76 ms per loop
100 loops, best of 3: 2.98 ms per loop
apply
是最慢的,这不足为奇, drop
版本的速度稍快,但从语义上讲,我更喜欢仅传递感兴趣的cols列表并调用count
以提高可读性
嗯,我现在不断变化着时间:
In [27]:
%timeit df[['a', 'b', 'c', 'd', 'e']].apply(lambda x: sum(x.notnull()), axis=1)
%timeit df.drop(['Close', 'Time'], axis=1).count(axis=1)
%timeit df[list('abcde')].count(axis=1)
%timeit df[['a', 'b', 'c', 'd', 'e']].count(axis=1)
100 loops, best of 3: 3.33 ms per loop
100 loops, best of 3: 2.7 ms per loop
100 loops, best of 3: 2.7 ms per loop
100 loops, best of 3: 2.57 ms per loop
更多时间
In [160]:
%timeit df[['a', 'b', 'c', 'd', 'e']].apply(lambda x: sum(x.notnull()), axis=1)
%timeit df.drop(['Close', 'Time'], axis=1).count(axis=1)
%timeit df[list('abcde')].count(axis=1)
%timeit df[['a', 'b', 'c', 'd', 'e']].count(axis=1)
%timeit df[list('abcde')].notnull().sum(axis=1)
1000 loops, best of 3: 1.4 ms per loop
1000 loops, best of 3: 1.14 ms per loop
1000 loops, best of 3: 1.11 ms per loop
1000 loops, best of 3: 1.11 ms per loop
1000 loops, best of 3: 1.05 ms per loop
似乎在此数据集上测试notnull
和求和(因为notnull
将产生布尔掩码)
在5万行df中,最后一种方法要快一些:
In [172]:
%timeit df[['a', 'b', 'c', 'd', 'e']].apply(lambda x: sum(x.notnull()), axis=1)
%timeit df.drop(['Close', 'Time'], axis=1).count(axis=1)
%timeit df[list('abcde')].count(axis=1)
%timeit df[['a', 'b', 'c', 'd', 'e']].count(axis=1)
%timeit df[list('abcde')].notnull().sum(axis=1)
1 loops, best of 3: 5.83 s per loop
100 loops, best of 3: 6.15 ms per loop
100 loops, best of 3: 6.49 ms per loop
100 loops, best of 3: 6.04 ms per loop
包括所需columns
的列表,或者只删除不想从计数中排除的两columns
-沿axis=1
(请参阅docs) :
df['Count'] = df.drop(['Close', 'Time'], axis=1).count(axis=1)
Close a b c d e Time Count
0 2051.25 5 4 3 1 1 05:00:00 5
1 2088.25 5 4 3 1 NaN 06:00:00 4
2 2081.50 5 4 3 NaN NaN 07:00:00 3
3 2058.25 5 4 3 NaN NaN 08:00:00 3
4 2042.25 5 4 NaN NaN NaN 09:00:00 2
df['Count'] = df[['a', 'b', 'c', 'd', 'e']].apply(lambda x: sum(x.notnull()), axis=1)
In [1254]: df
Out[1254]:
Close a b c d e Time Count
2015-12-03 2051.25 5 4 3 1 1 05:00:00 5
2015-12-04 2088.25 5 4 3 1 NaN 06:00:00 4
2015-12-07 2081.50 5 4 3 NaN NaN 07:00:00 3
2015-12-08 2058.25 5 4 NaN NaN NaN 08:00:00 2
2015-12-09 2042.25 5 NaN NaN NaN NaN 09:00:00 1
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.