简体   繁体   中英

Quickly eliminate "circular duplicates" in a big list (python)

I have this (python) list

my_list = [['dog','cat','mat','fun'],['bob','cat','pan','fun'],['dog','ben','mat','rat'],
['cat','mat','fun','dog'],['mat','fun','dog','cat'],['fun','dog','cat','mat'],
['rat','dog','ben','mat'],['dog','mat','cat','fun'], ...
]

my_list has 200704 elements

Note here
my_list[0] = ['dog','cat','mat','fun']
dog->cat->mat->fun->dog
my_list[3] = ['cat','mat','fun','dog']
cat->mat->fun->dog->cat
my_list[4] = ['mat','fun','dog','cat']
mat->fun->dog->cat->mat
my_list[5] = ['fun','dog','cat','mat']
fun->dog->cat->mat->fun
Going circular, they are all the same. So they should be marked duplicates.

Note:
my_list[0] = ['dog','cat','mat','fun']
my_list[7] = ['dog','mat','cat','fun']
These should NOT be marked duplicates since going circular, they are different.

Similarly,
my_list[2] = ['dog','ben','mat','rat']
my_list[6] = ['rat','dog','ben','mat']
They should be marked duplicates.

def remove_circular_duplicates(my_list):
    # the quicker and more elegent logic here

    # the function should identify that my_list[0], my_list[3], my_list[4] and my_list[5] are circular duplicates
    # keep only my_list[0] and delete the rest 3
    # same for my_list[2] and my_list[6] and so on

    return (my_list_with_no_circular_duplicates)

----------------------------------------------------------------
My try:
----------------------------------------------------------------
This works but, takes more than 3 hrs to finish 200704 elements.
And its not an elegant way too.. (pardon my level)

t=my_list
tLen=len(t)
while i<tLen:
    c=c+1
    if c>2000:
        # this is just to keep you informed of the progress
        print(f'{i} of {tLen} finished ..')
        c=0
    if (finalT[i][4]=='unmarked'):
        # make 0-1-2-3 -> 1-2-3-0 and check any duplicates
        x0,x1,x2,x3 = t[i][1],t[i][2],t[i][3],t[i][0]
        # make 0-1-2-3 -> 2-3-0-1 and check any duplicates
        y0,y1,y2,y3 = t[i][2],t[i][3],t[i][0],t[i][1]
        # make 0-1-2-3 -> 3-0-1-2 and check any duplicates
        z0,z1,z2,z3 = t[i][3],t[i][0],t[i][1],t[i][2]
        while j<tLen:
            if (finalT[j][4]=='unmarked' and j!=i):
                #j!=i skips checking the same (self) element
                tString=t[j][0]+t[j][1]+t[j][2]+t[j][3]
                if (x0+x1+x2+x3 == tString) or (y0+y1+y2+y3 == tString) or (z0+z1+z2+z3 == tString):
                    # duplicate found, mark it as 'duplicate'
                    finalT[j][4]='duplicate'
                tString=''
            j=j+1
        finalT[i][4] = 'original'
        j=0
    i=i+1
# make list of only those marked as 'original'
i=0
ultimateT = []
while i<tLen:
    if finalT[i][4] == 'original':
        ultimateT.append(finalT[i])
    i=i+1
# strip the 'oritinal' mark and keep only the quad
i=0
ultimateTLen=len(ultimateT)
while i<ultimateTLen:
    ultimateT[i].remove('original')
    i=i+1
my_list_with_no_curcular_duplicates = ultimateT

print (f'\n\nDONE!!  \nStarted at: {start_time}\nEnded at {datetime.datetime.now()}')
return my_list_with_no_circular_duplicates

What i want is a quicker way of doing the same.
Tnx in advance.

Your implementation is an n-squared algorithm, which means that the implementation time will grow dramatically for a large data set. 200,000 squared is a very large number. You need to convert this to an order n or n-log(n) algorithm. To do that you need to preprocess the data so that you can check whether a circularly equivalent item is also in the list without having to search through the list. To do that put each of the entries into a form that they can be compared without needing to iterate through the list. I would recommend that you rotate each entry so that it has the alphabetically first item first. For example change ['dog','cat','mat','fun'] to ['cat','mat','fun','dog']. That is an order n operation to process each element of the list once.

Then with them all in a common format you have several choices to determine if each entry is unique. I would use a set. For each item check if the item is in a set, if not, it is unique and should be added to the set. If the item is already in the set, then an equivalent item has already been found and this item can be removed. Checking if an item is in a set is a constant time operation in Python. It does this by using a hash table in order to index to find an item instead of needing to search. The result is this is is also an order n operation to go through each entry doing the check. Overall the algorithm is order n and will be dramatically faster than what you were doing.

@BradBudlong
Brad Budlong's answer is right. Following is the implementation result of the same.

My method (given in the question):
Time taken: ~274 min
Result: len(my_list_without_circular_duplicates) >> 50176

Brad Budlong's method:
Time taken: ~12 sec (great !)
Result: len(my_list_without_circular_duplicates) >> 50176

Following is just the implementation of Brad Budlong's method:

# extract all individual words like 'cat', 'rat', 'fun' and put in a list without duplicates 
all_non_duplicate_words_from_my_list = {.. the appropriate code here}
# and sort them alphabetically
alphabetically_sorted_words = sorted(all_non_duplicate_words_from_my_list)

# mark all as 'unsorted'
all_q_marked=[]
for i in my_list:
    all_q_marked.append([i,'unsorted'])

# format my_list- in Brad's words,
# rotate each entry so that it has the alphabetically first item first. 
# For example change ['dog','cat','mat','fun'] to ['cat','mat','fun','dog'] 
for w in alphabetically_sorted_words:
    print(f'{w} in progress ..')
    for q in all_q_marked:
        if q[1]=='unsorted':
            # check if the word exist in the quad
            if w in q[0]:
                # word exist, then rotate this quad to put that word in first place
                # rotation_count=q[0].index(w) -- alternate method lines
                quad=q[0]
                for j in range(4):
                    quad=quad[-1:] + quad[:-1]
                    if quad[0]==w:
                        q[0]=quad
                        break
                # mark as sorted
                q[1]='sorted'

# strip the 'sorted' mark and keep only the quad
i=0
formatted_my_list=[]
while i<len(all_q_marked):
    formatted_my_list.append(all_q_marked[i][0])
    i=i+1

# finally remove duplicate lists in the list
my_list_without_circular_duplicates = [list(t) for t in set(tuple(element) for element in formatted_my_list)]
print (my_list_without_circular_duplicates)

Note here, although it iterates and processes alphabetically_sorted_words (201) with entire all_q_marked (200704) still, the time taken to process exponentially reduces as elements in the all_q_marked gets marked as 'sorted'.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM