简体   繁体   中英

Find duplicates between 2 columns (independent order) , count and drop Python

I´m trying to find the duplicates between 2 columns, were order is independent, but i need to keep the count of duplicates after droping them

df = pd.DataFrame([['A','B'],['D','B'],['B','A'],['B','C'],['C','B']],
              columns=['source', 'target'],
              )

This is my expected result

    source  target   count
0     A       B        2
1     D       B        1
3     B       C        2

I've already tried several approaches, but I can't come close to a solution.

It does not matter which combination is maintained. In the result example I kept the first

Thanks in advance

You can use df.duplicated() to see which ones are duplicated, the output is true if item is duplicated and false if it isn't. For more infos and practical example check out the documentation

Create a summary based on applying a frozenset to your desired columns. Here we're using all columns.

summary = df.apply(frozenset, axis=1).value_counts()

This'll give you a Series of:

(A, B)    2
(C, B)    2
(B, D)    1
dtype: int64

You can then reconstruct a DataFrame by iterating over that series, eg:

df2 = pd.DataFrame(((*idx, val) for idx, val in summary.items()), columns=[*df.columns, 'count'])

Which results in:

  source target  count
0      A      B      2
1      C      B      2
2      B      D      1

The following approach creates a new column containing a set of the values in the columns specified. The advantage is that all other columns are preserved in the final result. Furthermore, the indices are preserved the same way as in the expected output you posted:

df = pd.DataFrame([['A','B'],['D','B'],['B','A'],['B','C'],['C','B']],
              columns=['source', 'target'],)

# Create column with set of both columns
df['tmp'] = df.apply(lambda x: frozenset([x['source'], x['target']]), axis=1)

# Create count column based on new tmp column
df['count'] = df.groupby(['tmp'])['target'].transform('size')

# Drop duplicate rows based on new tmp column
df = df[~df.duplicated(subset='tmp', keep='first')]

# Remove tmp column
df = df.drop('tmp', 1)

df

Output:

    source  target  count
0   A   B   2
1   D   B   1
3   B   C   2

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM