计算一列列表中唯一元素的有效方法？

Question

我的 dataframe 的每一行都有一个字符串列表。 我想计算列中唯一的字符串数。 我目前的方法很慢：

              words
0  we like to party
1  can can dance
2  yes we can
...

df["words"].apply(lambda x: len(np.unique(x, return_counts=True)[1]))

通缉 output： 7

它也不会检查一个单词是否出现在 2 行或更多行中，这会使它变得更慢。 这可以快速完成吗？ 谢谢！

Answer 1

我认为您需要通过连接和拆分单词创建的集合长度：

a = len(set(' '.join(df['words']).split()))
print (a)
7

如果有列表使用集合理解，谢谢@juanpa.arrivillaga：

print (df)
                   words
0  [we, like, to, party]
1      [can, can, dance]
2         [yes, we, can]


a = len({y for x in df['words'] for y in x})
print (a)
7

Answer 2

您可以使用例如下一个变体：

from itertools import chain
from operator import methodcaller

import pandas as pd

df = pd.DataFrame({
    "words": [
        "we like to party",
        "can can dance",
        "yes we can"
    ]
})

print(len(set(
    chain.from_iterable(
        map(methodcaller("split", " "), df.words.values)
    )
)))

计算一列列表中唯一元素的有效方法？

问题描述

2 个解决方案

解决方案1
2 已采纳 2021-03-22 10:04:10

解决方案2
2 2021-03-22 10:13:02

计算一列列表中唯一元素的有效方法？

问题描述

2 个解决方案

解决方案1 2 已采纳 2021-03-22 10:04:10

解决方案2 2 2021-03-22 10:13:02

解决方案1
2 已采纳 2021-03-22 10:04:10

解决方案2
2 2021-03-22 10:13:02