So I'm currently working on 5 dictionaries and very possibly more in the futur, with at least 257000+ entries each. Consider them as 5 huge text files(size: 10-20 Mb) with, say, 10-30 characters each line would be fine. An example of an entry be like:
abaissements volontaires,abaissement volontaire.N+NA:mp
My mission is to find out duplicate words between/among different dictionaires. So first of all, I have to process the file to get, for example, only abaissements volontaires from the example. After this part, my idea is to have a list that contains elements like:
dict_word_list = [[dict_A, [word1, word2, ...]], [dict_B, [word1, word2, ...]]]
The choice of lists over dicts is simply because dicts are unordered in Python and I have to know the name of the corresponding dictionary of each word list, so I put the corresponding dictionary names in element 0 of each list.
My question is how to find out duplicates between/among these huge lists and at the same time keep dictionary names? I tried if not in list but due to the file size and a very old processor(an intel core i3 in an old shabby laptop at work and I cannot use my own laptop due to confidentiality issues) , the program simply bugs there.
Maybe set would be a solution, but how do I shuffle the comparison? I would like to have results like:
Duplicates dict_A, dict_B: [word1, word2, word3, ...]
Duplicates dict_B, dict_C: [word1, word2, word3, ...]
Duplicates dict_A, dict_B, dict_C: [word1, word2, word3, ...]
Sets are a very good approach. You could do something like this:
dict_1 = {1, 2 ,3}
dict_2 = {2, 3, 4}
dict_3 = {3, 4, 5}
dict_1 & dict_2
{2, 3}
dict_1 & dict_2 & dict_3
{3}
From the docs:
s & t - new set with elements common to s and t
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.