[英]Count unique symbols per column in Pandas
I was wondering how to calculate the number of unique symbols that occur in a single column in a dataframe. 我想知道如何计算数据帧中单个列中出现的唯一符号的数量。 For example:
例如:
df = pd.DataFrame({'col1': ['a', 'bbb', 'cc', ''], 'col2': ['ddd', 'eeeee', 'ff', 'ggggggg']})
df col1 col2
0 a ddd
1 bbb eeeee
2 cc ff
3 gggggg
It should calculate that col1 contains 3 unique symbols, and col2 contains 4 unique symbols. 它应该计算col1包含3个唯一符号,col2包含4个唯一符号。
My code so far (but this might be wrong): 到目前为止我的代码(但这可能是错误的):
unique_symbols = [0]*203
i = 0
for col in df.columns:
observed_symbols = []
df_temp = df[[col]]
df_temp = df_temp.astype('str')
#This part is where I am not so sure
for index, row in df_temp.iterrows():
pass
if symbol not in observed_symbols:
observed_symbols.append(symbol)
unique_symbols[i] = len(observed_symbols)
i += 1
Thanks in advance 提前致谢
Here is one way: 这是一种方式:
df.apply(lambda x: len(set(''.join(x.astype(str)))))
col1 3
col2 4
Option 1 选项1
str.join
+ set
inside a dict comprehension str.join
+ set
在词典理解中
For problems like this, I'd prefer falling back to python, because it's so much faster. 对于这样的问题,我宁愿退回到python,因为它的速度要快得多。
{c : len(set(''.join(df[c]))) for c in df.columns}
{'col1': 3, 'col2': 4}
Option 2 选项2
agg
If you want to stay in pandas space. 如果你想留在熊猫空间。
df.agg(lambda x: set(''.join(x)), axis=0).str.len()
Or, 要么,
df.agg(lambda x: len(set(''.join(x))), axis=0)
col1 3
col2 4
dtype: int64
Maybe 也许
df.sum().apply(set).str.len()
Out[673]:
col1 3
col2 4
dtype: int64
One more option: 还有一个选择:
In [38]: df.applymap(lambda x: len(set(x))).sum()
Out[38]:
col1 3
col2 4
dtype: int64
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.