[英]Counting Unique Values in a Column
I have a df that has one column with multiple comma-separated values in each row. 我有一个df,其中一列的每一行都有多个逗号分隔的值。 I want to count how many times a unique value occurs in that column. 我想计算在该列中出现一个唯一值的次数。
The df looks like this: df看起来像这样:
category country
0 widget1, widget2, widget3, widget4 USA
1 widget1, widget3 USA
2 widget1, widget2 China
3 widget2 Canada
4 widget1, widget2, widget3 China
5 widget2 Vietnam
6 widget3 Canada
7 widget1, widget3 USA
8 widget1, widget3 Japan
9 widget2 Germany
I want know how many times each widget appears in the column "category". 我想知道每个小部件出现在“类别”列中的次数。 The results in this example would be: 本示例中的结果将是:
widget1 = 6, widget2 = 6, widget3 = 6, widget4 = 1 小部件1 = 6,小部件2 = 6,小部件3 = 6,小部件4 = 1
I can use .value_counts 我可以使用.value_counts
df["category"].value_counts()
but that's only going to return rows that are exactly the same. 但这只会返回完全相同的行。
I could use value_counts and enter each value for it to count, but in the actual DataFrame there are too many rows and unique values in that column to make it practical. 我可以使用value_counts并输入每个值进行计数,但是在实际的DataFrame中,该列中有太多行和唯一值,因此无法实用。
Also, is there a way to not double count if a single row contains two values that are the same? 另外,如果单行包含两个相同的值,是否有办法不重复计算? For example is there was a "widget1, black widget1, yellow widget1" in the same row, I'd just want to count that as one widget1. 例如,在同一行中有一个“ widget1,黑色widget1,黄色widget1”,我只想将其计为一个widget1。
与get_dummies
df.category.str.get_dummies(',').replace(0,np.nan).stack().sum(level=1)
Another solution would be to unnest your string to rows, then use value_counts
: 另一种解决方案是将字符串取消嵌套到行,然后使用value_counts
:
explode_str(df, 'category', ',').value_counts()
widget2 6
widget1 6
widget3 6
widget4 1
Name: category, dtype: int64
Function used from linked answer: 从链接答案中使用的功能:
def explode_str(df, col, sep):
s = df[col]
i = np.arange(len(s)).repeat(s.str.count(sep) + 1)
return df.iloc[i].assign(**{col: sep.join(s).split(sep)})
This might not be the most elegant solution but I think it should work. 这可能不是最优雅的解决方案,但我认为它应该有效。 Basically we need to separate each word in the Category column and then count the words. 基本上,我们需要将“类别”列中的每个单词分开,然后对单词进行计数。
from itertools import chain
words=[','.split(i) for i in df['Category'].tolist()]
words=[i.strip() for i in chain.from_iterable(words)]
pd.Series(words).value_counts()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.