大文件中每个人的唯一值总数

Question

I have this unique list: 我有这个独特的清单：

unique_list = {'apple', 'banana', 'coconut'}

I want to find how many of the elements occur exactly in my large text file. 我想查找在我的大型文本文件中确切出现了多少个元素。 I just need the number, not the names. 我只需要数字，而不是名字。 For example, if only 'apple' and 'banana' are found for a particular person, then it should return 2. 例如，如果只为特定的人找到“苹果”和“香蕉”，则它应返回2。

For each person (name and family name), I need to get how many of these unique fruit does this person have. 对于每个人（姓名和姓氏），我需要获得这个人有多少这种独特的水果。 In a large file, this might be difficult. 在大文件中，这可能很困难。 I need the fastest way to do it. 我需要最快的方法。

Let's say I get names from the text file: 假设我从文本文件中获取名称：

people = {'cody meltin', 'larisa harris', 'harry barry'}

The text file is as below: 文本文件如下：

Name           Fruit unit

cody melton    apple  3

cody melton    banana 5

cody melton    banana 7

larisa harris  apple  8

larisa harris  apple  5

The output should look like this: 输出应如下所示：

{'cody meltin':2, 'larisa harris':1, 'harry barry':0}

I do not want to use any packages, just built-ins and basic libraries. 我不想使用任何程序包，而仅使用内置程序和基本库。

Answer 1

you can leverage python's basic library - collections 您可以利用python的基本库- collections

from collections import Counter

dict(Counter(pd.Series(['cody', 'cody ', 'cody ', 'melton', 'melton', 'harry'])))

Output 输出量

{'cody ': 2, 'melton': 2, 'cody': 1, 'harry': 1}

In my example above, I passed a pd.Series as its argument, but in your case, you can pass df['name'] to it, which is a pd.Series object. 在上面的示例中，我传递了一个pd.Series作为其参数，但是在您的情况下，您可以将df['name']传递给它，它是一个pd.Series对象。

Answer 2

You don't specify what is the format of your source data, so let's say it's a list of lists: 您没有指定源数据的格式，所以我们说它是一个列表列表：

>>> data = [["cody melton", "apple", 3], ["cody melton", "banana", 5],
            ["cody melton", "banana", 7], ["larisa harris", "apple", 8],
            ["larisa harris", "apple", 5]]

When you are looking for performance in the "vanilla" python, look at the standard library - in this case collections.Counter ; 当您在“香草” python中寻找性能时，请查看标准库-在本例中为collections.Counter ; we'll use it to count all unique combos of name-fruit: 我们将使用它来计算名称水果的所有唯一组合：

>>> pairs = Counter(((x[0], x[1]) for x in data))
>>> pairs
Counter({('cody melton', 'banana'): 2, ('larisa harris', 'apple'): 2, ('cody melton', 'apple'): 1})

The argument is an iterator, that creates a tuple (name, fruit) out of the source data, and Counter does the counting of their occurrence. 该参数是一个迭代器，它从源数据中创建一个元组(name, fruit) ， Counter对它们的出现进行计数。

EDIT: And if you want to count only the ones where the fruit is in a specific set: 编辑：并且，如果您只想计算水果在特定集合中的数量，则：

fruits = set(['apple', 'banana', 'coconut'])

, then just add this as a condition in the comprehension: ，然后将其作为条件添加到理解中：

>>> pairs = Counter(((x[0], x[1]) for x in data if x[1] in fruits))

We're almost there - what is left is to count the occurrences of the individual names: 我们快到了-剩下的就是计算各个名称的出现：

>>> names = Counter((pair[0] for pair in pairs))
>>> names
Counter({'cody melton': 2, 'larisa harris': 1})
>>> dict(names)  # this is how to cast it to a regular dict
{'larisa harris': 1, 'cody melton': 2}

I see you have in your output a "harry barry" with 0 occurrences- they obviously did not appear in the source data , so just add them to the dict with value 0. 我看到您的输出中出现了0次“ harry barry”，它们显然没有出现在源data ，因此只需将它们添加到值为0的字典中即可。

Answer 3

Just do it: 去做就对了：

xx = ['apple', 'apple', 'banana', 'coconut'];
d = dict()

for x in xx:    
    if x in d:
        d[x] += 1
    else:
        d[x] = 1


print (d)

大文件中每个人的唯一值总数

问题描述

3 个解决方案

解决方案1
0 2019-02-03 05:22:39

解决方案2
0 2019-02-03 08:53:42

解决方案3
-1 2019-02-01 01:47:18

大文件中每个人的唯一值总数

问题描述

3 个解决方案

解决方案1 0 2019-02-03 05:22:39

解决方案2 0 2019-02-03 08:53:42

解决方案3 -1 2019-02-01 01:47:18

解决方案1
0 2019-02-03 05:22:39

解决方案2
0 2019-02-03 08:53:42

解决方案3
-1 2019-02-01 01:47:18