I have two dataframes df1
and df2
. First column in both is a customer ID which is an int
, but other columns contains various string values. I want to produce a new dataframe df3
that contains, for each customer ID, a set of values found in df2
but not in df1
.
Example:
df1
:
v1 v2 v3 v4
cust
1 A B B A
2 A A A A
3 B B A A
4 B C A A
df2
:
v1 v2 v3 v4
cust
1 A A C B
2 A A C B
3 C B B A
4 C B B A
Expected output:
cust
1 {C}
2 {B, C}
3 {C}
4 {}
In [2]: df_2 = pd.DataFrame({"KundelID" : list(range(1,11)),
...: 'V1' : list('AACCBBBCCC'),
...: 'V2' : list('AABBBCCCAA'),
...: 'V3' : list('CCBBBBBAAB'),
...: 'V4' : list('BBAACAAAAB')})
...: df_1 = pd.DataFrame({"KundelID" : list(range(1,11)),
...: 'V1' : list('AABBCCCCCC'),
...: 'V2' : list('BABCCCCAAA'),
...: 'V3' : list('BAAAAABBBB'),
...: 'V4' : list('AAAACCCCBB')})
In [3]: df_1
Out[3]:
KundelID V1 V2 V3 V4
0 1 A B B A
1 2 A A A A
2 3 B B A A
3 4 B C A A
4 5 C C A C
5 6 C C A C
6 7 C C B C
7 8 C A B C
8 9 C A B B
9 10 C A B B
In [4]: df_2
Out[4]:
KundelID V1 V2 V3 V4
0 1 A A C B
1 2 A A C B
2 3 C B B A
3 4 C B B A
4 5 B B B C
5 6 B C B A
6 7 B C B A
7 8 C C A A
8 9 C A A A
9 10 C A B B
In [7]: pd.DataFrame({"KundeID" : df_2.KundelID,
...: 'Not-in-df_1' : [','.join([i for i in df_2_ if not i in df_1_]) if [i for i in df_2_ if not i in df_1_] else None for df_1_,df_2_ in zip(df_1.T[1:].apply(np.unique), df_2.T[1:].apply(np.unique))]})
Out[7]:
KundeID Not-in-df_1
0 1 C
1 2 B,C
2 3 C
3 4 None
4 5 B
5 6 B
6 7 A
7 8 None
8 9 None
9 10 None
The idea is to transform all values in each row into a set
. Then, we can take the set difference for each customer ID. This avoids loops and list comprehensions:
df3 = (
pd
.concat([
df1.reindex(index=df2.index).apply(set, axis=1),
df2.apply(set, axis=1),
], axis=1)
.apply(lambda r: r[1].difference(r[0]), axis=1)
)
print(df3)
# Out:
cust
1 {C}
2 {B, C}
3 {C}
4 {}
Notes :
df1.reindex(index=df2.index)
is in case some IDs are absent from df1
or df2
). set
. For example ','.join(r[1].difference(r[0]))
as the lambda will make strings.Setup :
For future reference, in order to facilitate a reproducible example, it is a good idea to provide some code that can directly be copy/pasted by SO-ers for a quick start into your problem.
df1 = pd.read_csv(io.StringIO("""
1 A B B A
2 A A A A
3 B B A A
4 B C A A
"""), sep=' ', names='cust v1 v2 v3 v4'.split()).set_index('cust')
df2 = pd.read_csv(io.StringIO("""
1 A A C B
2 A A C B
3 C B B A
4 C B B A
"""), sep=' ', names='cust v1 v2 v3 v4'.split()).set_index('cust')
You transform each dataframe into a Series of sets, then perform a set operation across the Series, leveraging the intrinsic data alignment from pandas Series:
df2.apply(set, axis=1) - df1.apply(set, axis=1)
Output:
cust
1 {C}
2 {C, B}
3 {C}
4 {}
dtype: object
If you want the symmetric difference across datasets ( ie elements in either the set or other but not both), then it's better using pd.concat
:
dfs = [df1, df2]
pd.concat([df.apply(set, 1) for df in dfs], 1).apply(lambda x: x[0]^x[1], 1)
where 1 here stands for axis=1
. Also, replacing x[0]^x[1]
by set.symmetric_difference(*x)
should work as well.
Interestingly, Series_A ^ Series_B
doesn't work as expected, instead (apparently), it returns a bool Series telling us if the returning values from the set operations are not empty.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.