简体   繁体   中英

Finding duplicates in multiple HUGE lists in Python (compare 2, 3, 4, 5 lists)

So I'm currently working on 5 dictionaries and very possibly more in the futur, with at least 257000+ entries each. Consider them as 5 huge text files(size: 10-20 Mb) with, say, 10-30 characters each line would be fine. An example of an entry be like:

abaissements volontaires,abaissement volontaire.N+NA:mp

My mission is to find out duplicate words between/among different dictionaires. So first of all, I have to process the file to get, for example, only abaissements volontaires from the example. After this part, my idea is to have a list that contains elements like:

dict_word_list = [[dict_A, [word1, word2, ...]], [dict_B, [word1, word2, ...]]]

The choice of lists over dicts is simply because dicts are unordered in Python and I have to know the name of the corresponding dictionary of each word list, so I put the corresponding dictionary names in element 0 of each list.

My question is how to find out duplicates between/among these huge lists and at the same time keep dictionary names? I tried if not in list but due to the file size and a very old processor(an intel core i3 in an old shabby laptop at work and I cannot use my own laptop due to confidentiality issues) , the program simply bugs there.

Maybe set would be a solution, but how do I shuffle the comparison? I would like to have results like:

Duplicates dict_A, dict_B: [word1, word2, word3, ...]

Duplicates dict_B, dict_C: [word1, word2, word3, ...]

Duplicates dict_A, dict_B, dict_C: [word1, word2, word3, ...]

Sets are a very good approach. You could do something like this:

dict_1 = {1, 2 ,3}
dict_2 = {2, 3, 4}
dict_3 = {3, 4, 5}
dict_1 & dict_2
{2, 3}
dict_1 & dict_2 & dict_3
{3}

From the docs:

s & t - new set with elements common to s and t

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM