简体   繁体   English

如何在 Pandas DataFrame 中一次获取多个列的值计数?

[英]How to get value counts for multiple columns at once in Pandas DataFrame?

Given a Pandas DataFrame that has multiple columns with categorical values (0 or 1), is it possible to conveniently get the value_counts for every column at the same time?给定一个 Pandas DataFrame,它有多个带有分类值(0 或 1)的列,是否可以方便地同时获取每一列的 value_counts?

For example, suppose I generate a DataFrame as follows:例如,假设我生成一个 DataFrame 如下:

import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame(np.random.randint(0, 2, (10, 4)), columns=list('abcd'))

I can get a DataFrame like this:我可以这样得到一个 DataFrame:

   a  b  c  d
0  0  1  1  0
1  1  1  1  1
2  1  1  1  0
3  0  1  0  0
4  0  0  0  1
5  0  1  1  0
6  0  1  1  1
7  1  0  1  0
8  1  0  1  1
9  0  1  1  0

How do I conveniently get the value counts for every column and obtain the following conveniently?如何方便地获取每列的值计数并方便地获取以下内容?

   a  b  c  d
0  6  3  2  6
1  4  7  8  4

My current solution is:我目前的解决方案是:

pieces = []
for col in df.columns:
    tmp_series = df[col].value_counts()
    tmp_series.name = col
    pieces.append(tmp_series)
df_value_counts = pd.concat(pieces, axis=1)

But there must be a simpler way, like stacking, pivoting, or groupby?但必须有更简单的方法,比如堆叠、旋转或分组?

Just call apply and pass pd.Series.value_counts :只需调用apply并传递pd.Series.value_counts

In [212]:
df = pd.DataFrame(np.random.randint(0, 2, (10, 4)), columns=list('abcd'))
df.apply(pd.Series.value_counts)
Out[212]:
   a  b  c  d
0  4  6  4  3
1  6  4  6  7

There is actually a fairly interesting and advanced way of doing this problem with crosstab and melt实际上有一种相当有趣和先进的方法可以用crosstabmelt来解决这个问题

df = pd.DataFrame({'a': ['table', 'chair', 'chair', 'lamp', 'bed'],
                   'b': ['lamp', 'candle', 'chair', 'lamp', 'bed'],
                   'c': ['mirror', 'mirror', 'mirror', 'mirror', 'mirror']})

df

       a       b       c
0  table    lamp  mirror
1  chair  candle  mirror
2  chair   chair  mirror
3   lamp    lamp  mirror
4    bed     bed  mirror

We can first melt the DataFrame我们可以先融化DataFrame

df1 = df.melt(var_name='columns', value_name='index')
df1

   columns   index
0        a   table
1        a   chair
2        a   chair
3        a    lamp
4        a     bed
5        b    lamp
6        b  candle
7        b   chair
8        b    lamp
9        b     bed
10       c  mirror
11       c  mirror
12       c  mirror
13       c  mirror
14       c  mirror

And then use the crosstab function to count the values for each column.然后使用交叉表函数计算每列的值。 This preserves the data type as ints which wouldn't be the case for the currently selected answer:这会将数据类型保留为整数,而当前选择的答案并非如此:

pd.crosstab(index=df1['index'], columns=df1['columns'])

columns  a  b  c
index           
bed      1  1  0
candle   0  1  0
chair    2  1  0
lamp     1  2  0
mirror   0  0  5
table    1  0  0

Or in one line, which expands the column names to parameter names with ** (this is advanced)或者在一行中,将列名扩展为带有**的参数名(这是高级的)

pd.crosstab(**df.melt(var_name='columns', value_name='index'))

Also, value_counts is now a top-level function.此外, value_counts现在是一个顶级函数。 So you can simplify the currently selected answer to the following:因此,您可以将当前选择的答案简化为以下内容:

df.apply(pd.value_counts)

To get the counts only for specific columns:要仅获取特定列的计数:

df[['a', 'b']].apply(pd.Series.value_counts)

where df is the name of your dataframe and 'a' and 'b' are the columns for which you want to count the values.其中 df 是数据框的名称,“a”和“b”是要计算值的列。

The solution that selects all categorical columns and makes a dataframe with all value counts at once:选择所有分类列并立即创建包含所有值计数的数据框的解决方案:

df = pd.DataFrame({
'fruits': ['apple', 'mango', 'apple', 'mango', 'mango', 'pear', 'mango'],
'vegetables': ['cucumber', 'eggplant', 'tomato', 'tomato', 'tomato', 'tomato', 'pumpkin'],
'sauces': ['chili', 'chili', 'ketchup', 'ketchup', 'chili', '1000 islands', 'chili']})

cat_cols = df.select_dtypes(include=object).columns.tolist()
(pd.DataFrame(
    df[cat_cols]
    .melt(var_name='column', value_name='value')
    .value_counts())
.rename(columns={0: 'counts'})
.sort_values(by=['column', 'counts']))

                            counts
column      value   
fruits      pear            1
            apple           2
            mango           4
sauces      1000 islands    1
            ketchup         2
            chili           4
vegetables  pumpkin         1
            eggplant        1
            cucumber        1
            tomato          4
            

You can also try this code:你也可以试试这段代码:

for i in heart.columns:
    x = heart[i].value_counts()
    print("Column name is:",i,"and it value is:",x)

用一行包裹的解决方案看起来比使用 groupby、stacking 等更简单:

pd.concat([df[column].value_counts() for column in df], axis = 1)

This is what worked for me:这对我有用:

for column in df.columns:
     print("\n" + column)
     print(df[column].value_counts())

link to source 链接到源

您可以使用 lambda 函数:

df.apply(lambda x: x.value_counts())

Ran into this to see if there was a better way of doing what I was doing.碰到这个,看看是否有更好的方法来做我正在做的事情。 Turns out calling df.apply(pd.value_counts) on a DataFrame whose columns each have their own many distinct values will result in a pretty substantial performance hit.结果是在 DataFrame 上调用df.apply(pd.value_counts) ,其列的每个列都有自己的许多不同的值,这将导致相当大的性能损失。

In this case, it is better to simply iterate over the non-numeric columns in a dictionary comprehension, and leave it as a dictionary:在这种情况下,最好简单地遍历字典推导中的非数字列,并将其保留为字典:

types_to_count = {"object", "category", "string"}
result = {
    col: df[col].value_counts()
    for col in df.columns[df.dtypes.isin(types_to_count)]
}

The filtering by types_to_count helps to ensure you don't try to take the value_counts of continuous data.types_to_count过滤有助于确保您不会尝试获取连续数据的value_counts

Another solution which can be done:可以做的另一种解决方案:

df = pd.DataFrame(np.random.randint(0, 2, (10, 4)), columns=list('abcd'))
l1 = pd.Series()
for var in df.columns:
    l2 = df[var].value_counts()
    l1 = pd.concat([l1, l2], axis = 1)
l1

Sometimes some columns are subsequent in hierarchy, in that case I recommend to "group" them and then make counts:有时某些列在层次结构中是后续的,在这种情况下,我建议将它们“分组”然后进行计数:

# note: "_id" is whatever column you have to make the counts with len()
cat_cols = ['column_1', 'column_2']
df.groupby(cat_cols).agg(count=('_id', lambda x: len(x)))

 <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th></th> <th>count</th> </tr> <tr> <th>column_1</th> <th>column_2</th> <th></th> </tr> </thead> <tbody> <tr> <th rowspan="3" valign="top">category_1</th> <th>Excelent</th> <td>19</td> </tr> <tr> <th>Good</th> <td>11</td> </tr> <tr> <th>Bad</th> <td>1</td> </tr> <tr> <th rowspan="5" valign="top">category_2</th> <th>Happy</th> <td>48</td> </tr> <tr> <th>Good mood</th> <td>158</td> </tr> <tr> <th>Serious</th> <td>62</td> </tr> <tr> <th>Sad</th> <td>10</td> </tr> <tr> <th>Depressed</th> <td>8</td> </tr> </tbody> </table>

Bonus: you can change len(x) to x.nunique() or other lambda functions you want.奖励:您可以将 len(x) 更改为 x.nunique() 或其他您想要的 lambda 函数。

Applying the value_counts function gave be unexpected / not the most readable results.应用 value_counts 函数给出了意想不到的/不是最易读的结果。 But this approach seems super simple and easy to read:但是这种方法看起来超级简单易读:

df[["col1", "col2", "col3"]].value_counts()

Here is an example of results if the cols have boolean values:如果 cols 具有布尔值,则以下是结果示例:

col1               col2         col3
False              False        False        1000
                   True         False        1000
True               False        False        1000
                                True         1000
                   True         False        1000
                                True         1000
dtype: int64

you can list the column name你可以列出列名

list = ["a", "b", "c", "d"]

then run a for loop using value_counts() function然后使用 value_counts() function 运行 for 循环

for i in list:
  print(df[i].value_counts())
  print("\n")

you can also use this method given below您也可以使用下面给出的这种方法

for column in df.columns:
 print("\n" + column)
 print(df[column].value_counts())

I thought it would be nice if it could be implemented in a way that works also for columns with different sets of values.我认为,如果它能够以同样适用于具有不同值集的列的方式实现,那就太好了。

This code will generate a dataframe with hierarchical columns where the top column level signifies the column name from the original dataframe and at the lower level you get each two columns one for the values and one for the counts.此代码将生成一个带有分层列的 dataframe,其中最高列级别表示原始 dataframe 中的列名称,在较低级别,您将每两列一列用于值,一列用于计数。

def val_cnts_df(df):
    val_cnts_dict = {}
    max_length = 0
    for col in df:
        val_cnts_dict[col] = df[col].value_counts()
        max_length = max(max_length, len(val_cnts_dict[col]))

    lists = [[col, prefix] for col in val_cnts_dict.keys() for prefix in ['values', 'counts']]
    columns = pd.MultiIndex.from_tuples(lists, names=['column', 'value_counts'])

    val_cnts_df = pd.DataFrame(data=np.zeros((max_length, len(columns))), columns=columns)

    for col in val_cnts_dict:
        val_cnts_df[col, 'values'] = val_cnts_dict[col].reset_index()['index']
        val_cnts_df[col, 'counts'] = val_cnts_dict[col].reset_index()[col]
        
    return val_cnts_df

Example of results:结果示例:

autos = pd.DataFrame({'brand': ['Audi', 'Audi', 'VW', 'BMW', 'VW', 'VW'],
                      'gearbox': ['automatic', 'automatic', 'manual', 'automatic',
                                  'manual', 'manual'],
                     'doors': [5, 5, 5, 2, 5, 5]})

print(val_cnts_df(autos))
column        brand            gearbox        doors       
value_counts  values counts    values counts  values counts
0                 VW     3  automatic    3.0    5.0    5.0
1               Audi     2     manual    3.0    2.0    1.0
2                BMW     1        NaN    NaN    NaN    NaN
```

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM