Generate unique IDs for a list of strings with duplicates

Question

I want to generate IDs for strings that are being read from a text file. If the strings are duplicates, I want the first instance of the string to have an ID containing 6 characters. For the duplicates of that string, I want the ID to be the same as the original one, but with an additional two characters. I'm having trouble with the logic. Here's what I've done so far:

from itertools import groupby
import uuid
f = open('test.txt', 'r')
addresses = f.readlines()

list_of_addresses = ['Address']
list_of_ids = ['ID']


for x in addresses:
    list_of_addresses.append(x)


def find_duplicates():

    for x, y in groupby(sorted(list_of_addresses)):
        id = str(uuid.uuid4().get_hex().upper()[0:6])
        j = len(list(y))
        if j > 1:
            print str(j) + " instances of " + x
            list_of_ids.append(id)
        print list_of_ids

find_duplicates()

How should I approach this?

Edit: here's the contents of test.txt :

123 Test
123 Test
123 Test
321 Test
567 Test
567 Test

And the output:

3 occurences of 123 Test

['ID', 'C10DD8']
['ID', 'C10DD8']
2 occurences of 567 Test

['ID', 'C10DD8', '595C5E']
['ID', 'C10DD8', '595C5E']

Answer 1

If the strings are duplicates, I want the first instance of the string to have an ID containing 6 characters. For the duplicates of that string, I want the ID to be the same as the original one, but with an additional two characters.

Try using a collections.defaultdict .

Given

import ctypes
import collections as ct


filename = "test.txt"


def read_file(fname):
    """Read lines from a file."""
    with open(fname, "r") as f:
        for line in f:
            yield line.strip()

Code

dd = ct.defaultdict(list)
for x in read_file(filename):
    key = str(ctypes.c_size_t(hash(x)).value)      # make positive hashes
    if key[:6] not in dd:
        dd[key[:6]].append(x)
    else:
        dd[key[:8]].append(x)

dd

Output

defaultdict(list,
            {'133259': ['123 Test'],
             '13325942': ['123 Test', '123 Test'],
             '210763': ['567 Test'],
             '21076377': ['567 Test'],
             '240895': ['321 Test']})

The resulting dictionary has keys (of length 6) for every first occurrence of a unique line. For every successive replicate line, two additional characters are sliced for the key.

You can implement the keys however you wish. In this case, we used hash() to correlate the key to each unique line. We then sliced the desired sequence from the key. See also a post on making positive hash values from ctypes .

To inspect your results, create the appropriate lookup dictionaries from the defaultdict .

# Lookups 
occurrences = ct.defaultdict(int)
ids = ct.defaultdict(list)

for k, v in dd.items():
    key = v[0]
    occurrences[key] += len(v)
    ids[key].append(k)

# View data
for k, v in occurrences.items():
    print("{} instances of {}".format(v, k))
    print("IDs:", ids[k])
    print()

Output

1 instances of 321 Test
IDs: ['240895']

2 instances of 567 Test
IDs: ['21076377', '210763']

3 instances of 123 Test
IDs: ['13325942', '133259']

Answer 2

Your question is little confusing, I don't get what is criteria to generate id , here i am showing you just logic not exact solution, You can take help from logic

track={}
with open('file.txt') as f:
    for line_no,line in enumerate(f):
        if line.split()[0] not in track:
            track[line.split()[0]]=[['ID','your_unique_id']]
        else:
            #here put your logic what you want to append if id is dublicate
            track[line.split()[0]].append(['ID','dublicate_id'+str(line_no)])

print(track)

output:

{'123': [['ID', 'your_unique_id'], ['ID', 'dublicate_id1'], ['ID', 'dublicate_id2']], '321': [['ID', 'your_unique_id']], '567': [['ID', 'your_unique_id'], ['ID', 'dublicate_id5']]}

Generate unique IDs for a list of strings with duplicates

Question

2 answers

solution1
1 ACCPTED 2018-02-05 20:39:35

solution2
0 2018-02-06 09:57:17

Generate unique IDs for a list of strings with duplicates

Question

2 answers

solution1 1 ACCPTED 2018-02-05 20:39:35

solution2 0 2018-02-06 09:57:17

solution1
1 ACCPTED 2018-02-05 20:39:35

solution2
0 2018-02-06 09:57:17