简体   繁体   中英

Eliminate duplicates in dictionary Python

I have a csv file separated by tabs:

在此处输入图片说明

I need only to focus in the two first columns and find, for example, if the pair AB appears in the document again as BA and print AB if the BA appears. The same for the rest of pairs.

For the example proposed the output is: · AB & CD

    dic ={}
    import sys
    import os
    import pandas as pd
    import numpy as np
    import csv

    colnames = ['col1', 'col2', 'col3', 'col4', 'col5']

    data = pd.read_csv('koko.csv', names=colnames, delimiter='\t')

    col1 = data.col1.tolist()
    col2 = data.col2.tolist()

    dataset = list(zip(col1,col2))
    for a,b in dataset:
        if (a,b) and (b,a) in dataset:
        dic [a] = b
print (dic)

output = {'A': 'B', 'B': 'A', 'D': 'C', 'C':'D'}

How can I avoid duplicated (or swapped) results in the dictionary?

Does this work?:

import pandas as pd
import numpy as np

col_1 = ['A', 'B', 'C', 'B', 'D']
col_2 = ['B', 'C', 'D', 'A', 'C']

df = pd.DataFrame(np.column_stack([col_1,col_2]), columns = ['Col1', 'Col2'])

df['combined'] = list(zip(df['Col1'], df['Col2']))

final_set = set(tuple(sorted(t)) for t in df['combined'])

final_set looks like this:

 {('C', 'D'), ('A', 'B'), ('B', 'C')}

The output contains more than AB and CD because of the second row that has BC

The below should work,

example df used:

df = pd.DataFrame({'Col1' : ['A','C','D','B','D','A'], 'Col2' : ['B','D','C','A','C','B']})

This is the function I used:

 temp = df[['Col1','Col2']].apply(lambda row: sorted(row), axis = 1)
 print(temp[['Col1','Col2']].drop_duplicates())

useful links:

checking if a string is in alphabetical order in python

Difference between map, applymap and apply methods in Pandas

Here is one way.

df = pd.DataFrame({'Col1' : ['A','C','D','B','D','A','E'],
                   'Col2' : ['B','D','C','A','C','B','F']})

df = df.drop_duplicates()\
       .apply(sorted, axis=1)\
       .loc[df.duplicated(subset=['Col1', 'Col2'], keep=False)]\
       .drop_duplicates()

#   Col1 Col2
# 0    A    B
# 1    C    D

Explanation

The steps are:

  1. Remove duplicate rows.
  2. Sort dataframe by row.
  3. Remove unique rows by keeping only duplicates.
  4. Remove duplicate rows again.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM