简体   繁体   English

计算列中的唯一值

[英]Counting Unique Values in a Column

I have a df that has one column with multiple comma-separated values in each row. 我有一个df,其中一列的每一行都有多个逗号分隔的值。 I want to count how many times a unique value occurs in that column. 我想计算在该列中出现一个唯一值的次数。

The df looks like this: df看起来像这样:

                             category  country
0  widget1, widget2, widget3, widget4      USA
1                    widget1, widget3      USA
2                   widget1, widget2     China
3                             widget2   Canada
4           widget1, widget2, widget3    China
5                             widget2  Vietnam
6                             widget3   Canada
7                    widget1, widget3      USA
8                    widget1, widget3    Japan
9                             widget2  Germany 

样本数据框

I want know how many times each widget appears in the column "category". 我想知道每个小部件出现在“类别”列中的次数。 The results in this example would be: 本示例中的结果将是:

widget1 = 6, widget2 = 6, widget3 = 6, widget4 = 1 小部件1 = 6,小部件2 = 6,小部件3 = 6,小部件4 = 1

I can use .value_counts 我可以使用.value_counts

df["category"].value_counts()

but that's only going to return rows that are exactly the same. 但这只会返回完全相同的行。

在此处输入图片说明

I could use value_counts and enter each value for it to count, but in the actual DataFrame there are too many rows and unique values in that column to make it practical. 我可以使用value_counts并输入每个值进行计数,但是在实际的DataFrame中,该列中有太多行和唯一值,因此无法实用。

Also, is there a way to not double count if a single row contains two values that are the same? 另外,如果单行包含两个相同的值,是否有办法不重复计算? For example is there was a "widget1, black widget1, yellow widget1" in the same row, I'd just want to count that as one widget1. 例如,在同一行中有一个“ widget1,黑色widget1,黄色widget1”,我只想将其计为一个widget1。

get_dummies

df.category.str.get_dummies(',').replace(0,np.nan).stack().sum(level=1)

Another solution would be to unnest your string to rows, then use value_counts : 另一种解决方案是将字符串取消嵌套到行,然后使用value_counts

explode_str(df, 'category', ',').value_counts()

widget2    6
widget1    6
widget3    6
widget4    1
Name: category, dtype: int64

Function used from linked answer: 从链接答案中使用的功能:

def explode_str(df, col, sep):
    s = df[col]
    i = np.arange(len(s)).repeat(s.str.count(sep) + 1)
    return df.iloc[i].assign(**{col: sep.join(s).split(sep)})

This might not be the most elegant solution but I think it should work. 这可能不是最优雅的解决方案,但我认为它应该有效。 Basically we need to separate each word in the Category column and then count the words. 基本上,我们需要将“类别”列中的每个单词分开,然后对单词进行计数。

from itertools import chain
words=[','.split(i) for i in df['Category'].tolist()]
words=[i.strip() for i in chain.from_iterable(words)]
pd.Series(words).value_counts()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM