如果 pandas 中有公共/没有公共列或未知列，如何使用外连接进行合并

Question

Problem Statement: How to perform outer join if we dont have common key (as any additional key appear on问题陈述：如果我们没有公共键，如何执行外连接（因为任何附加键出现在

df_a from json_1:来自 json_1 的 df_a：

[
    {
        "bookid": "12345",
        "bookname": "who am i"                 
    }
]

df_b from json_2:来自 json_2 的 df_b：

[
    {
        "bookid": "12345",
        "bookname": "who am i",        
        "Author" : "asp"        
    }
]

Now i wanted to find difference between these two datafrmes by each key and value (as i need to write output to html table, each column comparison as seperate df )现在我想通过每个键和值找到这两个数据帧之间的区别（因为我需要将 output 写入 html 表，每列比较作为单独的 df ）

What I tried below :我在下面尝试了什么：

df1 = pd.merge(df_a[['bookid']],df_b[['bookid']],left_index=True,right_index=True)
df1['diff'] = np.where((df1['bookid_x']==df1['booid_y']),'No', 'Yes') 


df2 = pd.merge(df_a[['bookname']],df_b[['bookname']],left_index=True,right_index=True)
df2['diff'] = np.where((df2['bookname_x']==df2['bookname_y']),'No', 'Yes')

df3 = What should i write here for that unknown column of Author coming from df2 ? 

with open(r"c:\csv\booktest.html", 'w') as _file:     
     _file.write(df1.to_html(index=False) +  "<br>" + df2.to_html(index=False) + "<br>" + df3.to_html(index=False))

The problem is df_b data comes from different source, it might have additional column and values (i dont know what is column name would be in well before hand).问题是 df_b 数据来自不同的来源，它可能有额外的列和值（我不知道列名是什么）。

Expected output: (so when i finally compare two df, for example, as Author column is new column i got from df2 which is not present in df_a it should print NaN there预期 output ：（所以当我最终比较两个 df 时，例如，由于 Author 列是我从 df2 获得的新列，它在 df_a 中不存在，它应该在那里打印 NaN

  bookid      bookid       diff
  12345       12345        No

  bookname    bookname     diff
  who am i    who am i     No  
 
  Author      Author       diff
  NaN         asp          Yes

Answer 1

One way is to align both the data frames so that the columns are same using .align() .一种方法是使用.align()对齐两个数据框，使列相同。

_, df_a = df_b.align(df_a, fill_value=np.NaN)
_, df_b = df_a.align(df_b, fill_value=np.NaN)

Once you do this, both df_a and df_b will have the same columns.执行此操作后， df_a和df_b将具有相同的列。

print(df_a)
   Author bookid  bookname
0     NaN  12345  who am i

print(df_b)
  Author bookid  bookname
0    asp  12345  who am i

Now you can apply the logic you have to get df3现在你可以应用你必须得到的逻辑df3

df1 = pd.merge(df_a[['bookid']], df_b[['bookid']], left_index=True, right_index=True)
df1['diff'] = np.where((df1['bookid_x']==df1['bookid_y']), 'No', 'Yes')

df2 = pd.merge(df_a[['bookname']], df_b[['bookname']], left_index=True, right_index=True)
df2['diff'] = np.where((df2['bookname_x']==df2['bookname_y']), 'No', 'Yes')

df3 = pd.merge(df_a[['Author']], df_b[['Author']], left_index=True, right_index=True)
df3['diff'] = np.where((df3['Author_x']==df3['Author_y']), 'No', 'Yes')

print(df1)
print(df2)
print(df3)

Result:结果：

  bookid_x bookid_y diff
0    12345    12345   No
  bookname_x bookname_y diff
0   who am i   who am i   No
   Author_x Author_y diff
0       NaN      asp  Yes

EDIT:编辑：

Ofcourse, you can put your common statements into a loop - for each column in your df当然，您可以将常用语句放入循环中 - 对于 df 中的每一列

for col in df_b.columns:
    df_temp = pd.merge(df_a[[col]], df_b[[col]], left_index=True, right_index=True)
    df_temp['diff'] = np.where((df_temp[col+'_x'] == df_temp[col+'_y']), 'No', 'Yes')
    print(df_temp)

Or more effeciently, you can do this - merge both dfs (by all columns) and then find the diff between the pair of columns and export to the html with in the column loop.或者更有效的是，您可以这样做 - 合并两个 dfs（按所有列），然后找到这对列之间的差异并在列循环中导出到 html。

df_temp = pd.merge(df_a, df_b, left_index=True, right_index=True)
with open(r"booktest.html", 'w') as _file:
    for col in df_a.columns:
        df_temp[col+'_diff'] = np.where((df_temp[col+'_x'] == df_temp[col+'_y']), 'No', 'Yes')
        _file.write(df_temp[[col + '_x', col + '_y', col + '_diff']].to_html(index=False) + "<br>")
print(df_temp)

You can also do it without .merge , but to get it in the dataframe format in the html, you will have to initialize the dataframe for each column您也可以在没有.merge的情况下执行此操作，但是要在 html 中以 dataframe 格式获取它，您必须为每列初始化 dataframe

with open(r"booktest.html", 'w') as _file:
    for col in df_a.columns:
        df_temp = pd.DataFrame()
        df_temp[col + '_x'], df_temp[col + '_y'], df_temp[col + '_diff'] = df_a[col], df_b[col], np.where((df_a[col] == df_b[col]), 'No', 'Yes')
        _file.write(df_temp.to_html(index=False) + "<br>")

Result:结果：

EDIT 2:编辑2：

Fixed alignment as per comment根据评论修复了 alignment

text_align = '<style>.dataframe td { text-align: right; }</style>'
with open(r"booktest.html", 'w') as _file:
    for col in df_a.columns:
        df_temp = pd.DataFrame()
        df_temp[col + '_current'], df_temp[col + '_future'], df_temp[col + '_diff'] = df_a[col], df_b[col], np.where((df_a[col] == df_b[col]), 'No', 'Yes')
        _file.write(text_align + df_temp.to_html(index=False) + "<br>")
    print(df_temp)

Result:结果：

EDIT 3编辑 3

Making a column name as blank if all of its values are NaN s如果列的所有值都是NaN ，则将列名称设为空白

text_align = '<style>.dataframe td { text-align: right; }</style>'
with open(r"booktest.html", 'w') as _file:
    for col in df_a.columns:
        df_temp = pd.DataFrame()
        df_temp[col + '_current'], df_temp[col + '_future'], df_temp[col + '_diff'] = df_a[col], df_b[col], np.where((df_a[col] == df_b[col]), 'No', 'Yes')
        # check if the column values are all NaN and rename the column name
        [df_temp.rename(columns={c:''}, inplace=True) for c in df_temp.columns if df_temp[c].isnull().all()]
        df_temp.fillna('', inplace=True)
        # set  the display width before writing to html so that blank columns are not squeezed
        with pd.option_context('display.max_colwidth', -1):
            _file.write(text_align+df_temp.to_html(index=False) + "<br>")

Result:结果：

如果 pandas 中有公共/没有公共列或未知列，如何使用外连接进行合并

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-06-22 17:59:09

如果 pandas 中有公共/没有公共列或未知列，如何使用外连接进行合并

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-06-22 17:59:09

解决方案1
1 已采纳 2020-06-22 17:59:09