I can't see why value_counts is giving me the wrong answer. Here is a small example:
In [81]: d=pd.DataFrame([[0,0],[1,100],[0,100],[2,0],[3,100],[4,100],[4,100],[4,100],[1,100],[3,100]],columns=['key','score'])
In [82]: d
Out[82]:
key score
0 0 0
1 1 100
2 0 100
3 2 0
4 3 100
5 4 100
6 4 100
7 4 100
8 1 100
9 3 100
In [83]: g=d.groupby('key')['score']
In [84]: g.value_counts(bins=[0, 20, 40, 60, 80, 100])
Out[84]:
key score
0 (-0.001, 20.0] 1
(20.0, 40.0] 1
(40.0, 60.0] 0
(60.0, 80.0] 0
(80.0, 100.0] 0
1 (20.0, 40.0] 2
(-0.001, 20.0] 0
(40.0, 60.0] 0
(60.0, 80.0] 0
(80.0, 100.0] 0
2 (-0.001, 20.0] 1
(20.0, 40.0] 0
(40.0, 60.0] 0
(60.0, 80.0] 0
(80.0, 100.0] 0
3 (20.0, 40.0] 2
(-0.001, 20.0] 0
(40.0, 60.0] 0
(60.0, 80.0] 0
(80.0, 100.0] 0
4 (20.0, 40.0] 3
(-0.001, 20.0] 0
(40.0, 60.0] 0
(60.0, 80.0] 0
(80.0, 100.0] 0
Name: score, dtype: int64
The only values that occur in these data are 0 and 100. But value_counts tells me the range (20.0,40.0] has the most values and (80.0,100.0] has none.
Of course my real data has more values, different keys, etc. but this illustrates the problem I am seeing.
Why?
Here is another way of doing it to keep the integrity of the indexes.
d.groupby('key')['score'].apply(pd.Series.value_counts, bins=[0,20,40,60,80,100])
Output:
key
0 (80.0, 100.0] 1
(-0.001, 20.0] 1
(60.0, 80.0] 0
(40.0, 60.0] 0
(20.0, 40.0] 0
1 (80.0, 100.0] 2
(60.0, 80.0] 0
(40.0, 60.0] 0
(20.0, 40.0] 0
(-0.001, 20.0] 0
2 (-0.001, 20.0] 1
(80.0, 100.0] 0
(60.0, 80.0] 0
(40.0, 60.0] 0
(20.0, 40.0] 0
3 (80.0, 100.0] 2
(60.0, 80.0] 0
(40.0, 60.0] 0
(20.0, 40.0] 0
(-0.001, 20.0] 0
4 (80.0, 100.0] 3
(60.0, 80.0] 0
(40.0, 60.0] 0
(20.0, 40.0] 0
(-0.001, 20.0] 0
Name: score, dtype: int64
Interesting, this may be some bug in index alignment. A way around is to groupby().value_counts()
on cut
:
(pd.cut(d.score, bins=[0, 20, 40, 60, 80, 100],
include_lowest=True)
.groupby(d['key'])
.value_counts()
)
Output:
key score
0 (-0.001, 20.0] 1
(80.0, 100.0] 1
1 (80.0, 100.0] 2
2 (-0.001, 20.0] 1
3 (80.0, 100.0] 2
4 (80.0, 100.0] 3
Name: score, dtype: int64
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.