简体   繁体   English

Pandas:计算数据框中的唯一值

[英]Pandas: Counting unique values in a dataframe

We have a DataFrame that looks like this: 我们有一个如下所示的DataFrame:

> df.ix[:2,:10]
    0   1   2   3   4   5   6   7   8   9   10
0   NaN NaN NaN NaN  6   5  NaN NaN  4  NaN  5
1   NaN NaN NaN NaN  8  NaN NaN  7  NaN NaN  5
2   NaN NaN NaN NaN NaN  1  NaN NaN NaN NaN NaN

We simply want the counts of all unique values in the DataFrame. 我们只想要DataFrame中所有唯一值的计数。 A simple solution is: 一个简单的解决方案是

df.stack().value_counts() 

However: 1. It looks like stack returns a copy, not a view, which is memory prohibitive in this case. 但是:1。看起来stack返回一个副本,而不是视图,在这种情况下,内存禁止。 Is this correct? 它是否正确? 2. I want to group the DataFrame by rows, and then get the different histograms for each grouping. 2.我想按行对DataFrame进行分组,然后为每个分组获取不同的直方图。 If we ignore the memory issues with stack and use it for now, how does one do the grouping correctly? 如果我们忽略stack的内存问题并暂时使用它,那么如何正确地进行分组呢?

d = pd.DataFrame([[nan, 1, nan, 2, 3],
              [nan, 1, 1, 1, 3],
              [nan, 1, nan, 2, 3],
              [nan,2,2,2, 3]])

len(d.stack()) #14
d.stack().groupby(arange(4))
AssertionError: Grouper and axis must be same length

The stacked DataFrame has a MultiIndex, with a length of some number less than n_rows*n_columns , because the nan s are removed. 堆叠的DataFrame具有MultiIndex,其长度比n_rows*n_columns少一些,因为nan被删除了。

0  1    1
   3    2
   4    3
1  0    1
   1    1
   2    1
   3    1
   4    3
    ....

This means we don't easily know how to build our grouping. 这意味着我们不容易知道如何构建我们的分组。 It would be much better to just operate on the first level, but then I'm stuck on how to then apply the grouping I actually want. 只是在第一级操作会好得多,但后来我不知道如何应用我真正想要的分组。

d.stack().groupby(level=0).groupby(list('aabb'))
KeyError: 'a'

Edit: A solution, which doesn't use stacking: 编辑:一种不使用堆叠的解决方案:

f = lambda x: pd.value_counts(x.values.ravel())
d.groupby(list('aabb')).apply(f)
a  1    4
   3    2
   2    1
b  2    4
   3    2
   1    1
dtype: int64

Looks clunky, though. 但是看起来很笨重。 If there's a better option I'm happy to hear it. 如果有更好的选择,我很高兴听到它。

Edit: Dan's comment revealed I had a typo, though correcting that still doesn't get us to the finish line. 编辑:丹的评论显示我有一个错字,虽然纠正仍然没有让我们到达终点。

I think you are doing a row/column-wise operation so can use apply : 我认为你正在进行行/列操作,所以可以使用apply

In [11]: d.apply(pd.Series.value_counts, axis=1).fillna(0)
Out[11]: 
   1  2  3
0  1  1  1
1  4  0  1
2  1  1  1
3  0  4  1

Note: There is a value_counts DataFrame method in the works for 0.14... which will make this more efficient and more concise. 注意:有一个value_counts DataFrame方法可用于0.14 ...这将使这更有效,更简洁。

It's worth noting that the pandas value_counts function also works on a numpy array, so you can pass it the values of the DataFrame (as a 1-d array view using np.ravel ): 值得注意的是,熊猫value_counts功能还工作的numpy的阵列上,这样就可以把它传递数据帧的值(如使用1-d阵列视图 np.ravel ):

In [21]: pd.value_counts(d.values.ravel())
Out[21]: 
2    6
1    6
3    4
dtype: int64

Also, you were pretty close to getting this correct, but you'd need to stack and unstack: 此外,你非常接近正确,但你需要堆叠和取消堆栈:

In [22]: d.stack().groupby(level=0).apply(pd.Series.value_counts).unstack().fillna(0)
Out[22]: 
   1  2  3
0  1  1  1
1  4  0  1
2  1  1  1
3  0  4  1

This error seems somewhat self explanatory (4 != 16): 这个错误似乎有点自我解释(4!= 16):

len(d.stack()) #16
d.stack().groupby(arange(4))
AssertionError: Grouper and axis must be same length

perhaps you wanted to pass: 也许你想通过:

In [23]: np.repeat(np.arange(4), 4)
Out[23]: array([0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3])

Not enough rep to comment, but Andy's answer: 没有足够的代表评论,但安迪的答案:

pd.value_counts(d.values.ravel()) 

is what I have used personally, and seems to me to be by far the most versatile and easily-readable solution. 是我个人使用的,在我看来是迄今为止最通用和易读的解决方案。 Another advantage is that it is easy to use a subset of the columns: 另一个优点是易于使用列的子集:

pd.value_counts(d[[1,3,4,6,7]].values.ravel()) 

or 要么

pd.value_counts(d[["col_title1","col_title2"]].values.ravel()) 

Is there any disadvantage to this approach, or any particular reason you want to use stack and groupby? 这种方法有什么不利,或者你想使用stack和groupby的任何特殊原因?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM