[英]How to find duplicates from a full data frame using pandas?
I have a data frame with 3 columns of classes and 5 rows of students in each class. Some of these students are duplicates.我有一个数据框,每个 class 中有 3 列课程和 5 行学生。其中一些学生是重复的。 I want to list the most common student names from all the classes and list them in descending order, the number of times they exist, and which the classes they exist in.
我想列出所有班级中最常见的学生姓名,并按降序排列,它们出现的次数以及它们所在的班级。
df = pd.DataFrame({
'biology': ['ryan', 'sarah', 'tom', 'ed', 'jackson'],
'statistics': ['sarah', 'ed', 'jacob', 'ryan', 'de'],
'ecology': ['austin', 'ryan', 'tom', 'sam', 'sarah']
})
biology statistics ecology
0 ryan sarah austin
1 sarah ed ryan
2 tom jacob tom
3 ed ryan sam
4 jackson de sarah
I want the output look something like this:我希望 output 看起来像这样:
ryan, 3 classes, (biology, statistics, ecology)
sarah, 3 classes, (biology, statistics, ecology)
tom, 2 classes, (biology, ecology)
ed, 2 classes, (biology, statistics)
jackson, 1 class, (biology)
jacob, 1 class, (statistics)
de, 1 class, (statistics)
austin, 1 class, (ecology)
...and so on ...等等
Any help would be appreciated, I'm a beginner so I have been at this for a several hours.任何帮助将不胜感激,我是初学者,所以我已经花了几个小时了。 Brain is getting killed.
大脑正在被杀死。 Thanks!
谢谢!
We can melt
the DataFrame to get to long form, then groupby aggregate
with Named Aggregation to get both the number of classes, and the names of the classes, lastly we can sort_values
to get the highest frequency students first:我们可以
melt
DataFrame 得到长形式,然后groupby aggregate
与命名聚合得到班级的数量和班级的名称,最后我们可以sort_values
首先得到频率最高的学生:
output_df = (
df.melt(var_name='class name', value_name='student name')
.groupby('student name', as_index=False)
.agg(class_count=('class name', 'count'),
classes=('class name', tuple))
.sort_values('class_count', ascending=False, ignore_index=True)
)
output_df
: output_df
:
student name class_count classes
0 ryan 3 (biology, statistics, ecology)
1 sarah 3 (biology, statistics, ecology)
2 ed 2 (biology, statistics)
3 tom 2 (biology, ecology)
4 austin 1 (ecology,)
5 de 1 (statistics,)
6 jackson 1 (biology,)
7 jacob 1 (statistics,)
8 sam 1 (ecology,)
We can further conditionally add classes/class to class_count
and write to_csv
:我们可以进一步有条件地将类/类添加到
class_count
并写入to_csv
:
# Conditionally Add Classes/Class
output_df['class_count'] = output_df['class_count'].astype(str) + np.where(
output_df['class_count'].eq(1),
' class',
' classes'
)
# Write to CSV
output_df.to_csv('output.csv', index=False, header=None)
output.csv
: output.csv
:
ryan,3 classes,"('biology', 'statistics', 'ecology')"
sarah,3 classes,"('biology', 'statistics', 'ecology')"
ed,2 classes,"('biology', 'statistics')"
tom,2 classes,"('biology', 'ecology')"
austin,1 class,"('ecology',)"
de,1 class,"('statistics',)"
jackson,1 class,"('biology',)"
jacob,1 class,"('statistics',)"
sam,1 class,"('ecology',)"
Setup and imports:设置和导入:
import numpy as np
import pandas as pd
df = pd.DataFrame({
'biology': ['ryan', 'sarah', 'tom', 'ed', 'jackson'],
'statistics': ['sarah', 'ed', 'jacob', 'ryan', 'de'],
'ecology': ['austin', 'ryan', 'tom', 'sam', 'sarah']
})
df = pd.DataFrame({
'biology': ['ryan', 'sarah', 'tom', 'ed', 'jackson'],
'statistics': ['sarah', 'ed', 'jacob', 'ryan', 'de'],
'ecology': ['austin', 'ryan', 'tom', 'sam', 'sarah']
})
results = {}
for h in df:
for k,v in df[h].value_counts().items():
print(k,v)
if k in results:
results[k]['value'] += v
results[k]['class'].append(h)
else:
results[k] = {'value':v,'class':[h]}
results = {h:results[h] for h in sorted(results, key=lambda x:results[x]['value'],reverse=True)}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.