Reducing duplicates in Python list of lists

Question

I am writing a program that reads in a number of files and then indexes the terms in them. I am able to read in the files into a 2d array (list) in python, but then I need to remove the duplicates in the first column and store the index in a new column with the first appearance of the duplicated word.

For example:

['when', 1]
['yes', 1]
['', 1]
['greg', 1]
['17', 1]
['when',2]

the first column is the term, and the second is the DocID that it came from i want to be able to change this to:

['when', 1, 2]
['yes', 1]
['', 1]
['greg', 1]
['17', 1]

removing the duplicate.

This is what I have so far:

for j in range(0,len(index)):
        for r in range(1,len(index)):
                if index[j][0] == index[r][0]:
                        index[j].append(index[r][1])
                        index.remove(index[r])

i keep getting an out of range error at

if index[j][0] == index[r][0]:

and i think it is because I'm removing an object from the index so it is becoming smaller. any ideas would be much appreciated (and yes, I know I shouldn't modify the original, but this is just testing it on a small scale)

Answer 1

Wouldn't be more appropiate to build a dict / defaultdict ?

Something like:

from collections import defaultdict

ar = [['when', 1],
      ['yes', 1],
      ['', 1],
      ['greg', 1],
      ['17', 1],
      ['when',2]] 

result = defaultdict(list)
for lst in ar:
    result[lst[0]].append(lst[1])

Output:

>>> for k,v in result.items():
...     print(repr(k),v)
'' [1]
'yes' [1]
'greg' [1]
'when' [1, 2]
'17' [1]

Answer 2

Yes, your error comes from modifying the list in place. Besides, your solution would be ineffective for long lists. It's better to use a dictionary instead, and convert it back to a list at the end:

from collections import defaultdict
od = defaultdict(list)

for term, doc_id in index:
    od[term].append(doc_id)

result = [[term] + doc_ids for term, doc_ids in od.iteritems()]

print result
# [['', 1], ['yes', 1], ['greg', 1], ['when', 1, 2], ['17', 1]]

Answer 3

Actually, you could have done this using range() and len() . However, The beauty of python is that you can directly iterate elements in a list without indexes

Take a look around this code and try to understand.

#!/usr/bin/env python

def main():

    tot_array = \
    [ ['when', 1],
      ['yes', 1],
      ['', 1],
      ['greg', 1],
      ['17', 1],
      ['when',2]
    ]

    for aList1 in tot_array:
        for aList2 in tot_array:
            if aList1[0]==aList2[0] and aList1 !=aList2:
                aList1.append(aList2[1])
                tot_array.remove(aList2)
    print tot_array

    pass

if __name__ == '__main__':
    main()

The output would be looking like:

*** Remote Interpreter Reinitialized  ***
>>> 
[['when', 1, 2], ['yes', 1], ['', 1], ['greg', 1], ['17', 1]]

Reducing duplicates in Python list of lists

Question

3 answers

solution1
3 2012-02-28 16:20:05

solution2
1 2012-02-28 16:26:10

solution3
0 2012-02-28 16:56:50

Reducing duplicates in Python list of lists

Question

3 answers

solution1 3 2012-02-28 16:20:05

solution2 1 2012-02-28 16:26:10

solution3 0 2012-02-28 16:56:50

solution1
3 2012-02-28 16:20:05

solution2
1 2012-02-28 16:26:10

solution3
0 2012-02-28 16:56:50