[英]Compare two dataframes and retrieve common row elements
I need to compare two datasets:我需要比较两个数据集:
DF1 DF1
Subj 1 2 3
0 Biotech Cell culture Bioinfo Immunology
1 Zoology Cell culture Immunology NaN
2 Math Trigonometry Algebra NaN
3 Microbio Biotech NaN NaN
4 Physics Optics NaN NaN
DF2 DF2
Subj 1 2
0 Biotech Bioinfo Immunology
1 Zoology Immunology Botany
2 Microbio NaN NaN
3 Physics Optics Quantumphy
4 Math Trigonometry NaN
How I want my result dataframe:我如何想要我的结果 dataframe:
Subj 1 2
0 Biotech Bioinfo Immunology
1 Zoology Immunology NaN
2 Math Trigonometry NaN
3 Physics Optics NaN
I can't check row by row as the datasets are huge.由于数据集很大,我无法逐行检查。 The number of columns varies for both datasets, but rows are the same in number.
两个数据集的列数不同,但行数相同。 Since the order of the row elements also vary, I can't simply use merge().
由于行元素的顺序也不同,我不能简单地使用 merge()。 I tried compare function, but it either removes all common elements or forms a dataframe containing both.
我尝试比较 function,但它要么删除所有公共元素,要么 forms 和包含这两者的 dataframe。 I can't seem to pick out just the common elements.
我似乎不能只挑出共同的元素。
You can match columns and then set the subject column as an index while merging the dataframes:您可以匹配列,然后在合并数据帧时将主题列设置为索引:
match=df2.columns.intersection(df1.columns).tolist()
df2.merge(df1,on=match, how='left').reindex(df2.columns,axis=1).set_index('Subj').dropna(how='all')
which returns:返回:
1 2
Subj
Biotech Bioinfo Immunology
Zoology Immunology NaN
Math Trigonometry NaN
Physics Optics NaN
here is one way to do it这是一种方法
Understanding: number of column varies and and values in two DF are not under same column理解:列数不同并且两个 DF 中的值不在同一列下
# Stack both the DFs, after setting Subj as index
# this results in changing a wide format to long format
# concat the two DF to forma new DF
df3=pd.concat([df.set_index('Subj').stack().reset_index().rename(columns={0:'val'}),
df2.set_index('Subj').stack().reset_index().rename(columns={0:'val'})],
).reset_index()
# to find the same topic under a subject if it exists in two DFs
# the join will have duplicate rows
# so find the duplicated rows for Subj and Topic (val column)
# group the duplicated rows and aggregate to a comma separated values
# finally split on comma to create new columns
out=(df3[df3.duplicated(subset=['Subj','val'])]
.groupby('Subj')['val']
.agg(','.join)
.str
.split(',',expand=True).reset_index())
out
Subj 0 1
0 Biotech Bioinfo Immunology
1 Math Trigonometry None
2 Physics Optics None
3 Zoology Immunology None
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.