使用Python从列中获取唯一值

Question

I'm trying to get unique values from the column 'name' for every distinct value in column 'gender'. 我正在尝试从“性别”列中的每个不同值的“名称”列中获取唯一值。

Here's sample data: sample input_file_data: 这是示例数据：sample input_file_data：

index,name,gender,alive
1,Adam,Male,Y
2,Bella,Female,N
3,Marc,Male,Y
1,Adam,Male,N

I could get it when I give a value corresponding to 'gender' like for example, gave "Male" in the code below: 当我给出对应于'gender'的值时，我可以得到它，例如，在下面的代码中给出了“Male”：

filtered_data = filter(lambda person: person["gender"] == "Male", input_file_data)
reader = (dict((k, v.strip()) for k, v in row.items() if v) for row in filtered_data)
countt = [rec[gender] for rec in reader]
final1 = input_file_name + ".txt", "gender", "Male"
output1 = str(final1).replace("(", "").replace(")", "").replace("'","").replace(", [{", " -- [").replace("}", "")
final2 = set(re.findall(r"name': '(.*?)'", str(filtered_data)))
final_count = len(final2)
output = str(final_count) + " occurrences", str(final2)
output2 = output1, str(output)
output_final = str(output2).replace('\\', "").replace('"',"").replace(']"', "]").replace("set", "").replace("(", "").replace(")", "").replace("'","").replace(", [{", " -- [").replace("}", "")
output_final = output_final + "\n"

current output: 电流输出：

input_file_name.txt, gender, Male, 2 occurrences, [Adam,Marc]

Expected output: 预期产量：

input_file_name.txt, gender, Male, 2 occurrences, [Adam,Marc], Female, 1 occurrences [Bella]

which should show up all the unique occurrences of names, for every distinct gender value (without hardcoding). 这应该显示所有唯一出现的名称，每个不同的性别值（没有硬编码）。 Also I do not want to use Pandas. 我也不想使用熊猫。 Any help is highly appreciated. 任何帮助都非常感谢。

PS- I have multiple files and not all files have the same columns. PS-我有多个文件，并非所有文件都有相同的列。 So I can't hardcode them. 所以我不能硬编码。 Also, all the files have a 'name' column, but not all files have a 'gender' column. 此外，所有文件都有“名称”列，但并非所有文件都有“性别”列。 And this script should work for any other column like 'index' or 'alive' or anything else for that matter and not just gender. 此脚本应适用于任何其他列，如“索引”或“活着”或其他任何内容，而不仅仅是性别。

Answer 1

I would use the csv module along with the defaultdict from collections for this. 我将使用csv模块以及collections的defaultdict 。 Say this is stored in a file called test.csv: 假设它存储在名为test.csv的文件中：

>>> import csv
>>> from collections import defaultdict
>>> with open('test.csv', 'rb') as fin: data = list(csv.reader(fin))[1:]
>>> gender_dict = defaultdict(set)
>>> for idx, name, gender, alive in data:
    gender_dict[gender].add(name)

>>> gender_dict
defaultdict(<type 'set'>, {'Male': ['Adam', 'Marc'], 'Female': ['Bella']})

You now have a dictionary. 你现在有一本字典。 Each key is a unique value from the gender column. 每个键都是性别列中的唯一值。 Each value is a set, so you'll only get unique items. 每个值都是一个集合，因此您只能获得唯一的项目。 Notice that we added 'Adam' twice, but only see one in the resulting set. 请注意，我们添加了两次'Adam' ，但只在结果集中看到一个。

You don't need defaultdict , but it allows you to use less boilerplate code to check if a key exists. 您不需要defaultdict ，但它允许您使用较少的样板代码来检查密钥是否存在。

EDIT: It might help to have better visibility into the data itself. 编辑：可能有助于更好地了解数据本身。 Given your code, I can make the following assumptions: 鉴于您的代码，我可以做出以下假设：

input_file_data is an iterable (list, tuple, something like that) containing dictionaries. input_file_data是一个包含字典的iterable（list，tuple，类似的东西）。
Each dictionary contains a 'gender' key. 每个字典都包含一个'gender'键。 If it didn't include at least 'gender' , you would get a key error when trying to filter it. 如果它不包含至少'gender' ，则在尝试过滤时会出现关键错误。
Each dictionary has a 'name' key, it looks like. 每个字典都有一个'name'键，它看起来像。

Rather than doing all of that regex, what about this? 而不是做所有的正则表达式，这是什么？

>>> gender_dict = {'Male': set(), 'Female': set()}
>>> for item in input_file_data:
        gender_dict[item['gender']].add(item['name'])

You can use item.get('name') instead of item['name'] if not every entry will have a name. 如果不是每个条目都有名称，您可以使用item.get('name')而不是item['name'] 。

Edit #2: Ok, the first thing you need to do is get your data into a consistent state. 编辑＃2：好的，您需要做的第一件事就是让您的数据进入一致状态。 We can absolutely get to a point where you have a column name (gender, index, alive, whatever you want) and a set of unique names corresponding to those columns. 我们绝对可以得到一个列名称（性别，索引，活着，无论你想要什么）和一组与这些列对应的唯一名称。 Something like this: 像这样的东西：

data_dict = {'gender':
                 {'Male': ['Adam', 'Marc'],
                  'Female': ['Bella']}
             'alive':
                 {'Y': ['Adam', 'Marc'],
                  'N': ['Bella', 'Adam']}
             'index':
                 {1: ['Adam'],
                  2: ['Bella'],
                  3: ['Marc']}
              }

If that's what you want, you could try this: 如果这是你想要的，你可以试试这个：

>>> data_dict = defaultdict(lambda: defaultdict(lambda: defaultdict(set)))
>>> for element in input_file_data:
        for key, value in element.items():
            if key != 'name':
                data_dict[key][value].add(element[name])

That should get you what you want, I think? 我觉得那应该能得到你想要的东西吗？ I can't test as I don't have your data, but give it a try. 我无法测试，因为我没有您的数据，但请尝试一下。

使用Python从列中获取唯一值

问题描述

1 个解决方案

解决方案1
3 2015-03-25 17:03:26

使用Python从列中获取唯一值

问题描述

1 个解决方案

解决方案1 3 2015-03-25 17:03:26

解决方案1
3 2015-03-25 17:03:26