[英]How to figure the missing values by comparing between two data frames
我只想计算两个数据帧之间的缺失值,所以....这是我尝试过的代码并且工作正常
import pandas as pd
df1 = pd.DataFrame([1, 2, 3, 4, 5, 6], columns=["my_column"])
df2 = pd.DataFrame([1, 2, 3], columns=["my_column"])
result = df1[~df1.set_index(list(df1)).index.isin(df2.set_index(list(df2)).index)].dropna()
print(result)
输出:
my_column
3 4
4 5
5 6
所以它在静态数据帧上工作正常......
但是当我基于 sql 使用此代码时,我发现了一个问题:这是我的完整代码:
import pyodbc
import pandas as pd
import os
import sqlalchemy as db
from sqlalchemy import create_engine, MetaData, Table, Column, Integer, String, Date, Float
import datetime as dt
# connect db
engine = create_engine('mssql+pyodbc://xxxxxxxxxx\SMARTRNO_EXPRESS/myDB?driver=SQL+Server+Native+Client+11.0')
connection = engine.connect()
esn_datafeed_query = 'SELECT * FROM [myDB].[dbo].[esn_datafeed]'
esn_inter_intra_query = 'SELECT * FROM [esn_inter_intra_merge]'
esn_datafeed_df = pd.read_sql(esn_datafeed_query ,engine)
esn_inter_intra_merge_df = (esn_inter_intra_query, engine)
df1 = pd.DataFrame(esn_datafeed_df, columns=["st_umts_df_relation_key"])
df2 = pd.DataFrame(esn_inter_intra_merge_df, columns=["st_umts_esn_inter_intra_relation_key"])
result = df1[~df1.set_index(list(df1)).index.isin(df2.set_index(list(df2)).index)].dropna()
print(result)
所以前面的代码是显示所有值,我不需要这个......我只想显示缺失的值......我用不同的方式尝试了下面的代码:
esn_datafeed_df = pd.read_sql('SELECT * FROM [myDB].[dbo].[esn_datafeed]', engine)
esn_inter_intra_merge_df = pd.read_sql('SELECT * FROM [myDB].[dbo].[esn_inter_intra_merge]', engine)
df1 = pd.DataFrame(esn_datafeed_df, columns=["st_umts_df_relation_key"])
df2 = pd.DataFrame(esn_inter_intra_merge_df, columns=["st_umts_esn_inter_intra_relation_key"])
merged = df1.merge(df2 , how="left", indicator=True)
result = merged.query("_merge == 'left_only'")[["st_umts_df_relation_key"]]
print(result)
但我收到了这个错误:
Traceback (most recent call last):
File "C:/Users/haroo501/PycharmProjects/tool_check_nbr/my_missing_result.py", line 18, in <module>
merged = df1.merge(df2 , how="left", indicator=True)
File "C:\Users\haroo501\PycharmProjects\tool_check_nbr\venv\lib\site-packages\pandas\core\frame.py", line 7336, in merge
return merge(
File "C:\Users\haroo501\PycharmProjects\tool_check_nbr\venv\lib\site-packages\pandas\core\reshape\merge.py", line 68, in merge
op = _MergeOperation(
File "C:\Users\haroo501\PycharmProjects\tool_check_nbr\venv\lib\site-packages\pandas\core\reshape\merge.py", line 619, in __init__
self._validate_specification()
File "C:\Users\haroo501\PycharmProjects\tool_check_nbr\venv\lib\site-packages\pandas\core\reshape\merge.py", line 1183, in _validate_specification
raise MergeError(
pandas.errors.MergeError: No common columns to perform merge on. Merge options: left_on=None, right_on=None, left_index=False, right_index=False
我也试过这个代码:
df = df1.merge(df2, how = 'outer' ,indicator=True).loc[lambda x : x['_merge']=='left_only']
但我发现这个错误:
Traceback (most recent call last):
File "C:/Users/haroo501/PycharmProjects/tool_check_nbr/my_missing_result.py", line 23, in <module>
df = df1.merge(df2, how = 'outer' ,indicator=True).loc[lambda x : x['_merge']=='left_only']
File "C:\Users\haroo501\PycharmProjects\tool_check_nbr\venv\lib\site-packages\pandas\core\frame.py", line 7336, in merge
return merge(
File "C:\Users\haroo501\PycharmProjects\tool_check_nbr\venv\lib\site-packages\pandas\core\reshape\merge.py", line 68, in merge
op = _MergeOperation(
File "C:\Users\haroo501\PycharmProjects\tool_check_nbr\venv\lib\site-packages\pandas\core\reshape\merge.py", line 619, in __init__
self._validate_specification()
File "C:\Users\haroo501\PycharmProjects\tool_check_nbr\venv\lib\site-packages\pandas\core\reshape\merge.py", line 1183, in _validate_specification
raise MergeError(
pandas.errors.MergeError: No common columns to perform merge on. Merge options: left_on=None, right_on=None, left_index=False, right_index=False
因此,为了简要说明与我的数据库相关的内容,我有两个表
这是第二个表esn_inter_intra_merge
所以现在我想弄清楚这两个表,我需要计算的值之间的差异esn_datafeed.st_umts_df_relation_key
这是不是在esn_inter_intra_merge.st_umts_esn_inter_intra_relation_key
所以任何人都知道如何解决这个问题..... 可能是由于数据库中的数据量大?
有没有办法处理查询,这样就可以了...
我认为问题在于您的新数据框对列使用了不同的名称。 但是,听起来您无论如何都应该使用集合。 以下是如何获得两列之间值之间的对称差异。
missing_values = set(df1.iloc[:, 0]).symmetric_difference(set(df2.iloc[:, 0]))
>>> missing_values
{4, 5, 6}
然后您可以检查数据框值是否在这些缺失值中。
>>> df1[df1.iloc[:, 0].isin(missing_values)]
my_column
3 4
4 5
5 6
编辑
经过进一步思考,这不就是一个与pandas无关的SQL问题吗?
这样的东西有用吗? 此 SQL 查询从t1
( esn_datafeed
) 中选择所有记录,其中t2
( esn_inter_intra_merge
) 的st_umts_esn_inter_intra_relation_key
列中没有对应的st_umts_df_relation_key
值。
SELECT *
FROM esn_datafeed AS t1
LEFT JOIN esn_inter_intra_merge AS t2
ON t1.st_umts_df_relation_key = t2.st_umts_esn_inter_intra_relation_key
WHERE t2.st_umts_esn_inter_intra_relation_key IS NULL
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.