简体   繁体   中英

Pandas droping rows based on multiple conditions

I have two data frames

A = pd.DataFrame(
    [["abc@gmail.com","4311","3000","STR_1","1384"],
     ["abc@gmail.com","4311","3000","STR_2","1440"  ],
     ["xyz@gmail.com","4311","3000","STR_3","1300"  ],
     ["pqr@gmail.com","4311","3000","STR_3","1300"  ]],
    columns=["EMAIL",   "PRODUCT_ID",   "POST_CODE",    "STORE_NAME",   "STORE_ID"],
)

在此处输入图像描述

B = pd.DataFrame(
    [["abc@gmail.com","4311","3000","STR_1","1384"],
     ["xyz@gmail.com","4311","3000","STR_3","1300"  ],],
    columns=["EMAIL",   "PRODUCT_ID",   "POST_CODE",    "STORE_NAME",   "STORE_ID"],
)

在此处输入图像描述

Now I need to remove records from dataframe A that have the same EMAIL, PRODUCT_ID, and POST_CODE as data frame B. So the expected output is

在此处输入图像描述

I tried using drop duplicates like:

pd.concat([A, B]).drop_duplicates(keep=False)

But this cannot drop rows based on a custom column which is the POST_CODE in this case

use subset to select on the columns you want to filter out

pd.concat([A, B]).drop_duplicates(subset=["EMAIL",   "PRODUCT_ID",   "POST_CODE"], keep=False)

The solution for this contains the following elements:

  1. pandas set_index() function.
  2. pandas isin() function.

First we will set the index in the two dataframes to be "EMAIL", "PRODUCT_ID", "POST_CODE" then we can use these indexing to filter the dataframes using isin.

The code:

import pandas as pd

A = pd.DataFrame(
    [["abc@gmail.com","4311","3000","STR_1","1384"],
     ["abc@gmail.com","4311","3000","STR_2","1440"  ],
     ["xyz@gmail.com","4311","3000","STR_3","1300"  ],
     ["pqr@gmail.com","4311","3000","STR_3","1300"  ]],
    columns=["EMAIL",   "PRODUCT_ID",   "POST_CODE",    "STORE_NAME",   "STORE_ID"],
)

B = pd.DataFrame(
    [["abc@gmail.com","4311","3000","STR_1","1384"],
     ["xyz@gmail.com","4311","3000","STR_3","1300"  ],],
    columns=["EMAIL",   "PRODUCT_ID",   "POST_CODE",    "STORE_NAME",   "STORE_ID"],
)

i1 = A.set_index(["EMAIL", "PRODUCT_ID", "POST_CODE"]).index
i2 = B.set_index(["EMAIL", "PRODUCT_ID", "POST_CODE"]).index
result = A[~i1.isin(i2)]

Output:

     EMAIL         PRODUCT_ID    POST_CODE  STORE_NAME  STORE_ID
3   pqr@gmail.com    4311          3000       STR_3       1300

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM