简体   繁体   English

比较两个数据帧并检索公共行元素

[英]Compare two dataframes and retrieve common row elements

I need to compare two datasets:我需要比较两个数据集:

DF1 DF1

       Subj             1           2           3
0   Biotech   Cell culture     Bioinfo  Immunology
1   Zoology   Cell culture  Immunology         NaN
2      Math   Trigonometry     Algebra         NaN
3  Microbio        Biotech         NaN         NaN
4   Physics         Optics         NaN         NaN

DF2 DF2

       Subj             1           2           
0   Biotech       Bioinfo  Immunology         
1   Zoology    Immunology      Botany                  
2  Microbio         NaN           NaN         
3   Physics        Optics  Quantumphy
4      Math  Trigonometry         NaN         

How I want my result dataframe:我如何想要我的结果 dataframe:

       Subj             1           2          
0   Biotech       Bioinfo  Immunology         
1   Zoology    Immunology         NaN         
2      Math  Trigonometry         NaN         
3   Physics        Optics         NaN         

I can't check row by row as the datasets are huge.由于数据集很大,我无法逐行检查。 The number of columns varies for both datasets, but rows are the same in number.两个数据集的列数不同,但行数相同。 Since the order of the row elements also vary, I can't simply use merge().由于行元素的顺序也不同,我不能简单地使用 merge()。 I tried compare function, but it either removes all common elements or forms a dataframe containing both.我尝试比较 function,但它要么删除所有公共元素,要么 forms 和包含这两者的 dataframe。 I can't seem to pick out just the common elements.我似乎不能只挑出共同的元素。

You can match columns and then set the subject column as an index while merging the dataframes:您可以匹配列,然后在合并数据帧时将主题列设置为索引:

match=df2.columns.intersection(df1.columns).tolist()
df2.merge(df1,on=match, how='left').reindex(df2.columns,axis=1).set_index('Subj').dropna(how='all')

which returns:返回:

                    1           2
Subj                             
Biotech       Bioinfo  Immunology
Zoology    Immunology         NaN
Math     Trigonometry         NaN
Physics        Optics         NaN

here is one way to do it这是一种方法

Understanding: number of column varies and and values in two DF are not under same column理解:列数不同并且两个 DF 中的值不在同一列下

# Stack both the DFs, after setting Subj as index
# this results in changing a wide format to long format
# concat the two DF to forma new DF

df3=pd.concat([df.set_index('Subj').stack().reset_index().rename(columns={0:'val'}),
          df2.set_index('Subj').stack().reset_index().rename(columns={0:'val'})],
          ).reset_index()


# to find the same topic under a subject if it exists in two DFs
# the join will have duplicate rows

# so find the duplicated rows for Subj and Topic (val column)
# group the duplicated rows and aggregate to a comma separated values
# finally split on comma to create new columns

out=(df3[df3.duplicated(subset=['Subj','val'])]
 .groupby('Subj')['val']
 .agg(','.join)
 .str
 .split(',',expand=True).reset_index())
out
    Subj        0             1
0   Biotech     Bioinfo       Immunology
1   Math        Trigonometry  None
2   Physics     Optics        None
3   Zoology     Immunology    None

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM