简体   繁体   中英

using isin across multiple columns

I'm trying to use .isin with the ~ so I can get a list of unique rows back based on multiple columns in 2 data-sets.

So, I have 2 data-sets with 9 rows: df1 is the bottom and df2 is the top (sorry but I couldn't get it to show both below, it showed 1 then a row of numbers)

   Index    Serial  Count   Churn
     1       9         5    0
     2       8         6    0
     3       10        2    1
     4       7         4    2
     5       7         9    2
     6       10        2    2
     7       2         9    1
     8       9         8    3
     9       4         3    5


    Index   Serial  Count   Churn
     1       10      2       1
     2       10      2       1
     3       9       3       0
     4       8       6       0
     5       9       8       0
     6       1       9       1
     7       10      3       1
     8       6       7       1
     9       4       8       0

I would like to get a list of rows from df1 which aren't in df2 based on more than 1 column.

For example if I base my search on the columns Serial and Count I wouldn't get Index 1 and 2 back from df1 as it appears in df2 at Index position 6, the same with Index position 4 in df1 as it appears at Index position 2 in df2. The same would apply to Index position 5 in df1 as it is at Index position 8 in df2.

The churn column doesn't really matter.

I can get it to work but based only on 1 column but not on more than 1 column.

df2[~df2.Serial.isin(df1.Serial.values)] kinda does what I want, but only on 1 column. I want it to be based on 2 or more.

  Index Serial  Count   Churn
   3    9          3    0
   6    1          9    1
   7    10         3    1
   8    6          7    1
   9    4          8    0

One solution is to merge with indicators:

df1 = pd.DataFrame([[10, 2, 0], [9, 4, 1], [9, 8, 1], [8, 6, 1], [9, 8, 1], [1, 9, 1], [10, 3, 1], [6, 7, 1], [4, 8, 1]], columns=['Serial', 'Count', 'Churn'])
df2 = pd.DataFrame([[9, 5, 1], [8, 6, 1], [10, 2, 1], [7, 4, 1], [7, 9, 1], [10, 2, 1], [2, 9, 1], [9, 8, 1], [4, 3, 1]], columns=['Serial', 'Count', 'Churn'])
# merge with indicator on
df_temp = df1.merge(df2[['Serial', 'Count']].drop_duplicates(), on=['Serial', 'Count'], how='left', indicator=True)
res = df_temp.loc[df_temp['_merge'] == 'left_only'].drop('_merge', axis=1)

Output        
   Serial  Count  Churn
1       9      4      1
5       1      9      1
6      10      3      1
7       6      7      1
8       4      8      1

I've had similar issue to solve, I've found the easiest way to deal with it by creating a temporary column, which consists of merged identifier columns and utilising isin on this newly created temporary column values.

A simple function achieving this could be the following

from functools import reduce

get_temp_col = lambda df, cols: reduce(lambda x, y: x + df[y].astype('str'), cols, "")

def subset_on_x_columns(df1, df2, cols):
    """
    Subsets the input dataframe `df1` based on the missing unique values of input columns
    `cols` of dataframe `df2`.

    :param df1: Pandas dataframe to be subsetted
    :param df2: Pandas dataframe which missing values are going to be 
                used to subset `df1` by
    :param cols: List of column names
    """
    df1_temp_col = get_temp_col(df1, cols)
    df2_temp_col = get_temp_col(df2, cols)

    return df1[~df1_temp_col.isin(df2_temp_col.unique())]

Thus for your case all that is needed, is to execute:

result_df = subset_on_x_columns(df1, df2, ['Serial', 'Count'])

which has the wanted rows:

   Index  Serial  Count  Churn
      3       9      3      0
      6       1      9      1
      7      10      3      1
      8       6      7      1
      9       4      8      0

The nice thing about this solution is that it is naturally scalable in the number of columns to use, ie all that is needed is to specify in the input parameter list cols which columns to use as identifiers.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM