简体   繁体   中英

Uncommon values from two data frames by rows in python

I have two dataframes df1 and df2 . First column in both is a customer ID which is an int , but other columns contains various string values. I want to produce a new dataframe df3 that contains, for each customer ID, a set of values found in df2 but not in df1 .

Example:

df1 :

     v1 v2 v3 v4
cust            
1     A  B  B  A
2     A  A  A  A
3     B  B  A  A
4     B  C  A  A

df2 :

     v1 v2 v3 v4
cust            
1     A  A  C  B
2     A  A  C  B
3     C  B  B  A
4     C  B  B  A

Expected output:

cust
1       {C}
2    {B, C}
3       {C}
4        {}
In [2]: df_2 = pd.DataFrame({"KundelID" : list(range(1,11)),
   ...:               'V1' : list('AACCBBBCCC'),
   ...:               'V2' : list('AABBBCCCAA'),
   ...:               'V3' : list('CCBBBBBAAB'),
   ...:               'V4' : list('BBAACAAAAB')})
   ...: df_1 = pd.DataFrame({"KundelID" : list(range(1,11)),
   ...:               'V1' : list('AABBCCCCCC'),
   ...:               'V2' : list('BABCCCCAAA'),
   ...:               'V3' : list('BAAAAABBBB'),
   ...:               'V4' : list('AAAACCCCBB')})

In [3]: df_1
Out[3]: 
   KundelID V1 V2 V3 V4
0         1  A  B  B  A
1         2  A  A  A  A
2         3  B  B  A  A
3         4  B  C  A  A
4         5  C  C  A  C
5         6  C  C  A  C
6         7  C  C  B  C
7         8  C  A  B  C
8         9  C  A  B  B
9        10  C  A  B  B

In [4]: df_2
Out[4]: 
   KundelID V1 V2 V3 V4
0         1  A  A  C  B
1         2  A  A  C  B
2         3  C  B  B  A
3         4  C  B  B  A
4         5  B  B  B  C
5         6  B  C  B  A
6         7  B  C  B  A
7         8  C  C  A  A
8         9  C  A  A  A
9        10  C  A  B  B

In [7]: pd.DataFrame({"KundeID" : df_2.KundelID,
   ...:             'Not-in-df_1' : [','.join([i for i in df_2_ if not i in df_1_]) if [i for i in df_2_ if not i in df_1_] else None for df_1_,df_2_ in zip(df_1.T[1:].apply(np.unique), df_2.T[1:].apply(np.unique))]})
Out[7]: 
   KundeID Not-in-df_1
0        1           C
1        2         B,C
2        3           C
3        4        None
4        5           B
5        6           B
6        7           A
7        8        None
8        9        None
9       10        None


The idea is to transform all values in each row into a set . Then, we can take the set difference for each customer ID. This avoids loops and list comprehensions:

df3 = (
    pd
    .concat([
        df1.reindex(index=df2.index).apply(set, axis=1),
        df2.apply(set, axis=1),
    ], axis=1)
    .apply(lambda r: r[1].difference(r[0]), axis=1)
)
print(df3)
# Out:
cust
1       {C}
2    {B, C}
3       {C}
4        {}

Notes :

  1. The bit df1.reindex(index=df2.index) is in case some IDs are absent from df1 or df2 ).
  2. It is trivial to transform the output into something else instead of a set . For example ','.join(r[1].difference(r[0])) as the lambda will make strings.

Setup :

For future reference, in order to facilitate a reproducible example, it is a good idea to provide some code that can directly be copy/pasted by SO-ers for a quick start into your problem.

df1 = pd.read_csv(io.StringIO("""
1 A B B A
2 A A A A
3 B B A A
4 B C A A
"""), sep=' ', names='cust v1 v2 v3 v4'.split()).set_index('cust')

df2 = pd.read_csv(io.StringIO("""
1 A A C B
2 A A C B
3 C B B A
4 C B B A
"""), sep=' ', names='cust v1 v2 v3 v4'.split()).set_index('cust')

You transform each dataframe into a Series of sets, then perform a set operation across the Series, leveraging the intrinsic data alignment from pandas Series:

df2.apply(set, axis=1) - df1.apply(set, axis=1)

Output:

cust
1       {C}
2    {C, B}
3       {C}
4        {}
dtype: object

If you want the symmetric difference across datasets ( ie elements in either the set or other but not both), then it's better using pd.concat :

dfs = [df1, df2]
pd.concat([df.apply(set, 1) for df in dfs], 1).apply(lambda x: x[0]^x[1], 1)

where 1 here stands for axis=1 . Also, replacing x[0]^x[1] by set.symmetric_difference(*x) should work as well.

Interestingly, Series_A ^ Series_B doesn't work as expected, instead (apparently), it returns a bool Series telling us if the returning values from the set operations are not empty.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM