简体   繁体   English

如何通过比较两个数据框来计算缺失值

[英]How to figure the missing values by comparing between two data frames

I just want to figure the missing values between two data frames so.... Here's the code I tried and works fine我只想计算两个数据帧之间的缺失值,所以....这是我尝试过的代码并且工作正常

import pandas as pd


df1 = pd.DataFrame([1, 2, 3, 4, 5, 6], columns=["my_column"])
df2 = pd.DataFrame([1, 2, 3], columns=["my_column"])

result = df1[~df1.set_index(list(df1)).index.isin(df2.set_index(list(df2)).index)].dropna()


print(result)

Output:输出:

   my_column
3          4
4          5
5          6

So it works fine on a static dataframe....所以它在静态数据帧上工作正常......

But I figure a problem when I use this code based on sql: So here's my full code:但是当我基于 sql 使用此代码时,我发现了一个问题:这是我的完整代码:

import pyodbc
import pandas as pd
import os
import sqlalchemy as db
from sqlalchemy import create_engine, MetaData, Table, Column, Integer, String, Date, Float
import datetime as dt

# connect db
engine = create_engine('mssql+pyodbc://xxxxxxxxxx\SMARTRNO_EXPRESS/myDB?driver=SQL+Server+Native+Client+11.0')
connection = engine.connect()


esn_datafeed_query = 'SELECT * FROM [myDB].[dbo].[esn_datafeed]'
esn_inter_intra_query = 'SELECT * FROM [esn_inter_intra_merge]'

esn_datafeed_df = pd.read_sql(esn_datafeed_query ,engine)
esn_inter_intra_merge_df = (esn_inter_intra_query, engine)

df1 = pd.DataFrame(esn_datafeed_df, columns=["st_umts_df_relation_key"])
df2 = pd.DataFrame(esn_inter_intra_merge_df, columns=["st_umts_esn_inter_intra_relation_key"])

result = df1[~df1.set_index(list(df1)).index.isin(df2.set_index(list(df2)).index)].dropna()


print(result)

So the previous code Is shows all the values, I don't need this... I just want to show missing values only.... I tried it with different way as the below code:所以前面的代码是显示所有值,我不需要这个......我只想显示缺失的值......我用不同的方式尝试了下面的代码:

esn_datafeed_df = pd.read_sql('SELECT * FROM [myDB].[dbo].[esn_datafeed]', engine)
esn_inter_intra_merge_df = pd.read_sql('SELECT * FROM [myDB].[dbo].[esn_inter_intra_merge]', engine)

df1 = pd.DataFrame(esn_datafeed_df, columns=["st_umts_df_relation_key"])
df2 = pd.DataFrame(esn_inter_intra_merge_df, columns=["st_umts_esn_inter_intra_relation_key"])

merged = df1.merge(df2 , how="left", indicator=True)
result = merged.query("_merge == 'left_only'")[["st_umts_df_relation_key"]]

print(result)

but I got this error:但我收到了这个错误:

Traceback (most recent call last):
  File "C:/Users/haroo501/PycharmProjects/tool_check_nbr/my_missing_result.py", line 18, in <module>
    merged = df1.merge(df2 , how="left", indicator=True)
  File "C:\Users\haroo501\PycharmProjects\tool_check_nbr\venv\lib\site-packages\pandas\core\frame.py", line 7336, in merge
    return merge(
  File "C:\Users\haroo501\PycharmProjects\tool_check_nbr\venv\lib\site-packages\pandas\core\reshape\merge.py", line 68, in merge
    op = _MergeOperation(
  File "C:\Users\haroo501\PycharmProjects\tool_check_nbr\venv\lib\site-packages\pandas\core\reshape\merge.py", line 619, in __init__
    self._validate_specification()
  File "C:\Users\haroo501\PycharmProjects\tool_check_nbr\venv\lib\site-packages\pandas\core\reshape\merge.py", line 1183, in _validate_specification
    raise MergeError(
pandas.errors.MergeError: No common columns to perform merge on. Merge options: left_on=None, right_on=None, left_index=False, right_index=False

Edited已编辑

I tried also this code:我也试过这个代码:

df = df1.merge(df2, how = 'outer' ,indicator=True).loc[lambda x : x['_merge']=='left_only']

but I find this error:但我发现这个错误:

Traceback (most recent call last):
  File "C:/Users/haroo501/PycharmProjects/tool_check_nbr/my_missing_result.py", line 23, in <module>
    df = df1.merge(df2, how = 'outer' ,indicator=True).loc[lambda x : x['_merge']=='left_only']
  File "C:\Users\haroo501\PycharmProjects\tool_check_nbr\venv\lib\site-packages\pandas\core\frame.py", line 7336, in merge
    return merge(
  File "C:\Users\haroo501\PycharmProjects\tool_check_nbr\venv\lib\site-packages\pandas\core\reshape\merge.py", line 68, in merge
    op = _MergeOperation(
  File "C:\Users\haroo501\PycharmProjects\tool_check_nbr\venv\lib\site-packages\pandas\core\reshape\merge.py", line 619, in __init__
    self._validate_specification()
  File "C:\Users\haroo501\PycharmProjects\tool_check_nbr\venv\lib\site-packages\pandas\core\reshape\merge.py", line 1183, in _validate_specification
    raise MergeError(
pandas.errors.MergeError: No common columns to perform merge on. Merge options: left_on=None, right_on=None, left_index=False, right_index=False

So to explain in brief related to my database I have two tables因此,为了简要说明与我的数据库相关的内容,我有两个表

esn_datafeed esn_datafeed

and this is the second table esn_inter_intra_merge这是第二个表esn_inter_intra_merge

st_umts_esn_inter_intra_relation_key

So now I want to figure the difference between the two tables which I need to figure the values in esn_datafeed.st_umts_df_relation_key which is not in esn_inter_intra_merge.st_umts_esn_inter_intra_relation_key所以现在我想弄清楚这两个表,我需要计算的值之间的差异esn_datafeed.st_umts_df_relation_key这是不是在esn_inter_intra_merge.st_umts_esn_inter_intra_relation_key

So anyone have any idea how to solve this..... May be due to the large data in the database?所以任何人都知道如何解决这个问题..... 可能是由于数据库中的数据量大?

Is there's a way to do with a query so it will be okay...有没有办法处理查询,这样就可以了...

I think the issue is that your new dataframes use different names for the columns.我认为问题在于您的新数据框对列使用了不同的名称。 However, it sounds like you should be using sets anyway.但是,听起来您无论如何都应该使用集合。 Here is how to get the symmetric difference between values between two columns.以下是如何获得两列之间值之间的对称差异

missing_values = set(df1.iloc[:, 0]).symmetric_difference(set(df2.iloc[:, 0]))
>>> missing_values
{4, 5, 6}

Then you can check if the dataframe values are in these missing values.然后您可以检查数据框值是否在这些缺失值中。

>>> df1[df1.iloc[:, 0].isin(missing_values)]
   my_column
3          4
4          5
5          6

EDIT编辑

Upon further reflection, isn't this simply a SQL question that has nothing to do with pandas?经过进一步思考,这不就是一个与pandas无关的SQL问题吗?

Does something like this work?这样的东西有用吗? This SQL query selects all records from t1 ( esn_datafeed ) where there are no corresponding values of st_umts_df_relation_key in the st_umts_esn_inter_intra_relation_key column of t2 ( esn_inter_intra_merge ).此 SQL 查询从t1 ( esn_datafeed ) 中选择所有记录,其中t2 ( esn_inter_intra_merge ) 的st_umts_esn_inter_intra_relation_key列中没有对应的st_umts_df_relation_key值。

SELECT * 
FROM esn_datafeed AS t1
LEFT JOIN esn_inter_intra_merge AS t2
ON t1.st_umts_df_relation_key = t2.st_umts_esn_inter_intra_relation_key
WHERE t2.st_umts_esn_inter_intra_relation_key IS NULL

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM