简体   繁体   English

如何使用 pandas 从完整数据框中查找重复项?

[英]How to find duplicates from a full data frame using pandas?

I have a data frame with 3 columns of classes and 5 rows of students in each class. Some of these students are duplicates.我有一个数据框,每个 class 中有 3 列课程和 5 行学生。其中一些学生是重复的。 I want to list the most common student names from all the classes and list them in descending order, the number of times they exist, and which the classes they exist in.我想列出所有班级中最常见的学生姓名,并按降序排列,它们出现的次数以及它们所在的班级。

df = pd.DataFrame({
    'biology': ['ryan', 'sarah', 'tom', 'ed', 'jackson'],
    'statistics': ['sarah', 'ed', 'jacob', 'ryan', 'de'],
    'ecology': ['austin', 'ryan', 'tom', 'sam', 'sarah']
})

   biology statistics ecology
0     ryan      sarah  austin
1    sarah         ed    ryan
2      tom      jacob     tom
3       ed       ryan     sam
4  jackson         de   sarah

I want the output look something like this:我希望 output 看起来像这样:

ryan, 3 classes, (biology, statistics, ecology)
sarah, 3 classes, (biology, statistics, ecology)
tom, 2 classes, (biology, ecology)
ed, 2 classes, (biology, statistics)
jackson, 1 class, (biology)
jacob, 1 class, (statistics)
de, 1 class, (statistics)
austin, 1 class, (ecology)

...and so on ...等等

Any help would be appreciated, I'm a beginner so I have been at this for a several hours.任何帮助将不胜感激,我是初学者,所以我已经花了几个小时了。 Brain is getting killed.大脑正在被杀死。 Thanks!谢谢!

We can melt the DataFrame to get to long form, then groupby aggregate with Named Aggregation to get both the number of classes, and the names of the classes, lastly we can sort_values to get the highest frequency students first:我们可以melt DataFrame 得到长形式,然后groupby aggregate与命名聚合得到班级的数量和班级的名称,最后我们可以sort_values首先得到频率最高的学生:

output_df = (
    df.melt(var_name='class name', value_name='student name')
        .groupby('student name', as_index=False)
        .agg(class_count=('class name', 'count'),
             classes=('class name', tuple))
        .sort_values('class_count', ascending=False, ignore_index=True)
)

output_df : output_df

  student name  class_count                         classes
0         ryan            3  (biology, statistics, ecology)
1        sarah            3  (biology, statistics, ecology)
2           ed            2           (biology, statistics)
3          tom            2              (biology, ecology)
4       austin            1                      (ecology,)
5           de            1                   (statistics,)
6      jackson            1                      (biology,)
7        jacob            1                   (statistics,)
8          sam            1                      (ecology,)

We can further conditionally add classes/class to class_count and write to_csv :我们可以进一步有条件地将类/类添加到class_count并写入to_csv

# Conditionally Add Classes/Class
output_df['class_count'] = output_df['class_count'].astype(str) + np.where(
    output_df['class_count'].eq(1),
    ' class',
    ' classes'
)
# Write to CSV
output_df.to_csv('output.csv', index=False, header=None)

output.csv : output.csv :

ryan,3 classes,"('biology', 'statistics', 'ecology')"
sarah,3 classes,"('biology', 'statistics', 'ecology')"
ed,2 classes,"('biology', 'statistics')"
tom,2 classes,"('biology', 'ecology')"
austin,1 class,"('ecology',)"
de,1 class,"('statistics',)"
jackson,1 class,"('biology',)"
jacob,1 class,"('statistics',)"
sam,1 class,"('ecology',)"

Setup and imports:设置和导入:

import numpy as np
import pandas as pd

df = pd.DataFrame({
    'biology': ['ryan', 'sarah', 'tom', 'ed', 'jackson'],
    'statistics': ['sarah', 'ed', 'jacob', 'ryan', 'de'],
    'ecology': ['austin', 'ryan', 'tom', 'sam', 'sarah']
})
df = pd.DataFrame({
    'biology': ['ryan', 'sarah', 'tom', 'ed', 'jackson'],
    'statistics': ['sarah', 'ed', 'jacob', 'ryan', 'de'],
    'ecology': ['austin', 'ryan', 'tom', 'sam', 'sarah']
})

results = {}
for h in df:
    for k,v in df[h].value_counts().items():
        print(k,v)
        if k in results:
            results[k]['value'] += v
            results[k]['class'].append(h)
        else:
            results[k] = {'value':v,'class':[h]}
results = {h:results[h] for h in sorted(results, key=lambda x:results[x]['value'],reverse=True)}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM