计算列中的唯一值

Question

I have a df that has one column with multiple comma-separated values in each row. 我有一个df，其中一列的每一行都有多个逗号分隔的值。 I want to count how many times a unique value occurs in that column. 我想计算在该列中出现一个唯一值的次数。

The df looks like this: df看起来像这样：

                             category  country
0  widget1, widget2, widget3, widget4      USA
1                    widget1, widget3      USA
2                   widget1, widget2     China
3                             widget2   Canada
4           widget1, widget2, widget3    China
5                             widget2  Vietnam
6                             widget3   Canada
7                    widget1, widget3      USA
8                    widget1, widget3    Japan
9                             widget2  Germany

I want know how many times each widget appears in the column "category". 我想知道每个小部件出现在“类别”列中的次数。 The results in this example would be: 本示例中的结果将是：

widget1 = 6, widget2 = 6, widget3 = 6, widget4 = 1 小部件1 = 6，小部件2 = 6，小部件3 = 6，小部件4 = 1

I can use .value_counts 我可以使用.value_counts

df["category"].value_counts()

but that's only going to return rows that are exactly the same. 但这只会返回完全相同的行。

I could use value_counts and enter each value for it to count, but in the actual DataFrame there are too many rows and unique values in that column to make it practical. 我可以使用value_counts并输入每个值进行计数，但是在实际的DataFrame中，该列中有太多行和唯一值，因此无法实用。

Also, is there a way to not double count if a single row contains two values that are the same? 另外，如果单行包含两个相同的值，是否有办法不重复计算？ For example is there was a "widget1, black widget1, yellow widget1" in the same row, I'd just want to count that as one widget1. 例如，在同一行中有一个“ widget1，黑色widget1，黄色widget1”，我只想将其计为一个widget1。

Answer 1

与get_dummies

df.category.str.get_dummies(',').replace(0,np.nan).stack().sum(level=1)

Answer 2

Another solution would be to unnest your string to rows, then use value_counts : 另一种解决方案是将字符串取消嵌套到行，然后使用value_counts ：

explode_str(df, 'category', ',').value_counts()

widget2    6
widget1    6
widget3    6
widget4    1
Name: category, dtype: int64

Function used from linked answer: 从链接答案中使用的功能：

def explode_str(df, col, sep):
    s = df[col]
    i = np.arange(len(s)).repeat(s.str.count(sep) + 1)
    return df.iloc[i].assign(**{col: sep.join(s).split(sep)})

Answer 3

This might not be the most elegant solution but I think it should work. 这可能不是最优雅的解决方案，但我认为它应该有效。 Basically we need to separate each word in the Category column and then count the words. 基本上，我们需要将“类别”列中的每个单词分开，然后对单词进行计数。

from itertools import chain
words=[','.split(i) for i in df['Category'].tolist()]
words=[i.strip() for i in chain.from_iterable(words)]
pd.Series(words).value_counts()

计算列中的唯一值

问题描述

3 个解决方案

解决方案1
4 已采纳 2019-05-22 15:56:06

解决方案2
1 2019-05-22 16:18:49

解决方案3
0 2019-05-22 15:59:09

计算列中的唯一值

问题描述

3 个解决方案

解决方案1 4 已采纳 2019-05-22 15:56:06

解决方案2 1 2019-05-22 16:18:49

解决方案3 0 2019-05-22 15:59:09

解决方案1
4 已采纳 2019-05-22 15:56:06

解决方案2
1 2019-05-22 16:18:49

解决方案3
0 2019-05-22 15:59:09