[英]How to merge using outer join if there is common / no common column or unknown column in pandas
Problem Statement: How to perform outer join if we dont have common key (as any additional key appear on问题陈述:如果我们没有公共键,如何执行外连接(因为任何附加键出现在
df_a from json_1:来自 json_1 的 df_a:
[
{
"bookid": "12345",
"bookname": "who am i"
}
]
df_b from json_2:来自 json_2 的 df_b:
[
{
"bookid": "12345",
"bookname": "who am i",
"Author" : "asp"
}
]
Now i wanted to find difference between these two datafrmes by each key and value (as i need to write output to html table, each column comparison as seperate df )现在我想通过每个键和值找到这两个数据帧之间的区别(因为我需要将 output 写入 html 表,每列比较作为单独的 df )
What I tried below :我在下面尝试了什么:
df1 = pd.merge(df_a[['bookid']],df_b[['bookid']],left_index=True,right_index=True)
df1['diff'] = np.where((df1['bookid_x']==df1['booid_y']),'No', 'Yes')
df2 = pd.merge(df_a[['bookname']],df_b[['bookname']],left_index=True,right_index=True)
df2['diff'] = np.where((df2['bookname_x']==df2['bookname_y']),'No', 'Yes')
df3 = What should i write here for that unknown column of Author coming from df2 ?
with open(r"c:\csv\booktest.html", 'w') as _file:
_file.write(df1.to_html(index=False) + "<br>" + df2.to_html(index=False) + "<br>" + df3.to_html(index=False))
The problem is df_b data comes from different source, it might have additional column and values (i dont know what is column name would be in well before hand).问题是 df_b 数据来自不同的来源,它可能有额外的列和值(我不知道列名是什么)。
Expected output: (so when i finally compare two df, for example, as Author column is new column i got from df2 which is not present in df_a it should print NaN there预期 output :(所以当我最终比较两个 df 时,例如,由于 Author 列是我从 df2 获得的新列,它在 df_a 中不存在,它应该在那里打印 NaN
bookid bookid diff
12345 12345 No
bookname bookname diff
who am i who am i No
Author Author diff
NaN asp Yes
One way is to align both the data frames so that the columns are same using .align()
.一种方法是使用.align()
对齐两个数据框,使列相同。
_, df_a = df_b.align(df_a, fill_value=np.NaN)
_, df_b = df_a.align(df_b, fill_value=np.NaN)
Once you do this, both df_a
and df_b
will have the same columns.执行此操作后, df_a
和df_b
将具有相同的列。
print(df_a)
Author bookid bookname
0 NaN 12345 who am i
print(df_b)
Author bookid bookname
0 asp 12345 who am i
Now you can apply the logic you have to get df3
现在你可以应用你必须得到的逻辑df3
df1 = pd.merge(df_a[['bookid']], df_b[['bookid']], left_index=True, right_index=True)
df1['diff'] = np.where((df1['bookid_x']==df1['bookid_y']), 'No', 'Yes')
df2 = pd.merge(df_a[['bookname']], df_b[['bookname']], left_index=True, right_index=True)
df2['diff'] = np.where((df2['bookname_x']==df2['bookname_y']), 'No', 'Yes')
df3 = pd.merge(df_a[['Author']], df_b[['Author']], left_index=True, right_index=True)
df3['diff'] = np.where((df3['Author_x']==df3['Author_y']), 'No', 'Yes')
print(df1)
print(df2)
print(df3)
Result:结果:
bookid_x bookid_y diff
0 12345 12345 No
bookname_x bookname_y diff
0 who am i who am i No
Author_x Author_y diff
0 NaN asp Yes
EDIT:编辑:
Ofcourse, you can put your common statements into a loop - for each column in your df当然,您可以将常用语句放入循环中 - 对于 df 中的每一列
for col in df_b.columns:
df_temp = pd.merge(df_a[[col]], df_b[[col]], left_index=True, right_index=True)
df_temp['diff'] = np.where((df_temp[col+'_x'] == df_temp[col+'_y']), 'No', 'Yes')
print(df_temp)
Or more effeciently, you can do this - merge both dfs (by all columns) and then find the diff between the pair of columns and export to the html with in the column loop.或者更有效的是,您可以这样做 - 合并两个 dfs(按所有列),然后找到这对列之间的差异并在列循环中导出到 html。
df_temp = pd.merge(df_a, df_b, left_index=True, right_index=True)
with open(r"booktest.html", 'w') as _file:
for col in df_a.columns:
df_temp[col+'_diff'] = np.where((df_temp[col+'_x'] == df_temp[col+'_y']), 'No', 'Yes')
_file.write(df_temp[[col + '_x', col + '_y', col + '_diff']].to_html(index=False) + "<br>")
print(df_temp)
You can also do it without .merge
, but to get it in the dataframe format in the html, you will have to initialize the dataframe for each column您也可以在没有.merge
的情况下执行此操作,但是要在 html 中以 dataframe 格式获取它,您必须为每列初始化 dataframe
with open(r"booktest.html", 'w') as _file:
for col in df_a.columns:
df_temp = pd.DataFrame()
df_temp[col + '_x'], df_temp[col + '_y'], df_temp[col + '_diff'] = df_a[col], df_b[col], np.where((df_a[col] == df_b[col]), 'No', 'Yes')
_file.write(df_temp.to_html(index=False) + "<br>")
Result:结果:
EDIT 2:编辑2:
Fixed alignment as per comment根据评论修复了 alignment
text_align = '<style>.dataframe td { text-align: right; }</style>'
with open(r"booktest.html", 'w') as _file:
for col in df_a.columns:
df_temp = pd.DataFrame()
df_temp[col + '_current'], df_temp[col + '_future'], df_temp[col + '_diff'] = df_a[col], df_b[col], np.where((df_a[col] == df_b[col]), 'No', 'Yes')
_file.write(text_align + df_temp.to_html(index=False) + "<br>")
print(df_temp)
Result:结果:
EDIT 3编辑 3
Making a column name as blank if all of its values are NaN
s如果列的所有值都是NaN
,则将列名称设为空白
text_align = '<style>.dataframe td { text-align: right; }</style>'
with open(r"booktest.html", 'w') as _file:
for col in df_a.columns:
df_temp = pd.DataFrame()
df_temp[col + '_current'], df_temp[col + '_future'], df_temp[col + '_diff'] = df_a[col], df_b[col], np.where((df_a[col] == df_b[col]), 'No', 'Yes')
# check if the column values are all NaN and rename the column name
[df_temp.rename(columns={c:''}, inplace=True) for c in df_temp.columns if df_temp[c].isnull().all()]
df_temp.fillna('', inplace=True)
# set the display width before writing to html so that blank columns are not squeezed
with pd.option_context('display.max_colwidth', -1):
_file.write(text_align+df_temp.to_html(index=False) + "<br>")
Result:结果:
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.