简体   繁体   English

为具有重复项的字符串列表生成唯一 ID

[英]Generate unique IDs for a list of strings with duplicates

I want to generate IDs for strings that are being read from a text file.我想为从文本文件中读取的字符串生成 ID。 If the strings are duplicates, I want the first instance of the string to have an ID containing 6 characters.如果字符串是重复的,我希望字符串的第一个实例具有包含 6 个字符的 ID。 For the duplicates of that string, I want the ID to be the same as the original one, but with an additional two characters.对于该字符串的重复项,我希望 ID 与原始 ID 相同,但多了两个字符。 I'm having trouble with the logic.我的逻辑有问题。 Here's what I've done so far:这是我到目前为止所做的:

from itertools import groupby
import uuid
f = open('test.txt', 'r')
addresses = f.readlines()

list_of_addresses = ['Address']
list_of_ids = ['ID']


for x in addresses:
    list_of_addresses.append(x)


def find_duplicates():

    for x, y in groupby(sorted(list_of_addresses)):
        id = str(uuid.uuid4().get_hex().upper()[0:6])
        j = len(list(y))
        if j > 1:
            print str(j) + " instances of " + x
            list_of_ids.append(id)
        print list_of_ids

find_duplicates()

How should I approach this?我应该如何处理这个问题?

Edit: here's the contents of test.txt :编辑:这里是test.txt的内容:

123 Test
123 Test
123 Test
321 Test
567 Test
567 Test

And the output:和输出:

3 occurences of 123 Test

['ID', 'C10DD8']
['ID', 'C10DD8']
2 occurences of 567 Test

['ID', 'C10DD8', '595C5E']
['ID', 'C10DD8', '595C5E']

If the strings are duplicates, I want the first instance of the string to have an ID containing 6 characters.如果字符串是重复的,我希望字符串的第一个实例具有包含 6 个字符的 ID。 For the duplicates of that string, I want the ID to be the same as the original one, but with an additional two characters.对于该字符串的重复项,我希望 ID 与原始 ID 相同,但多了两个字符。

Try using a collections.defaultdict .尝试使用collections.defaultdict

Given给定的

import ctypes
import collections as ct


filename = "test.txt"


def read_file(fname):
    """Read lines from a file."""
    with open(fname, "r") as f:
        for line in f:
            yield line.strip()

Code代码

dd = ct.defaultdict(list)
for x in read_file(filename):
    key = str(ctypes.c_size_t(hash(x)).value)      # make positive hashes
    if key[:6] not in dd:
        dd[key[:6]].append(x)
    else:
        dd[key[:8]].append(x)

dd

Output输出

defaultdict(list,
            {'133259': ['123 Test'],
             '13325942': ['123 Test', '123 Test'],
             '210763': ['567 Test'],
             '21076377': ['567 Test'],
             '240895': ['321 Test']})

The resulting dictionary has keys (of length 6) for every first occurrence of a unique line.生成的字典对于唯一行的每次第一次出现都有键(长度为 6)。 For every successive replicate line, two additional characters are sliced for the key.对于每个连续的复制行,密钥的两个附加字符被切片。

You can implement the keys however you wish.您可以随意实现这些键。 In this case, we used hash() to correlate the key to each unique line.在这种情况下,我们使用hash()将键与每个唯一的行相关联。 We then sliced the desired sequence from the key.然后我们从键中切出所需的序列。 See also a post on making positive hash values from ctypes .另请参阅有关ctypes正哈希值的帖子。


To inspect your results, create the appropriate lookup dictionaries from the defaultdict .要检查您的结果,请从defaultdict创建适当的查找字典。

# Lookups 
occurrences = ct.defaultdict(int)
ids = ct.defaultdict(list)

for k, v in dd.items():
    key = v[0]
    occurrences[key] += len(v)
    ids[key].append(k)

# View data
for k, v in occurrences.items():
    print("{} instances of {}".format(v, k))
    print("IDs:", ids[k])
    print()

Output输出

1 instances of 321 Test
IDs: ['240895']

2 instances of 567 Test
IDs: ['21076377', '210763']

3 instances of 123 Test
IDs: ['13325942', '133259']

Your question is little confusing, I don't get what is criteria to generate id , here i am showing you just logic not exact solution, You can take help from logic您的问题有点令人困惑,我不明白生成 id 的标准是什么,在这里我向您展示的只是逻辑而不是确切的解决方案,您可以从逻辑中获取帮助

track={}
with open('file.txt') as f:
    for line_no,line in enumerate(f):
        if line.split()[0] not in track:
            track[line.split()[0]]=[['ID','your_unique_id']]
        else:
            #here put your logic what you want to append if id is dublicate
            track[line.split()[0]].append(['ID','dublicate_id'+str(line_no)])

print(track)

output:输出:

{'123': [['ID', 'your_unique_id'], ['ID', 'dublicate_id1'], ['ID', 'dublicate_id2']], '321': [['ID', 'your_unique_id']], '567': [['ID', 'your_unique_id'], ['ID', 'dublicate_id5']]}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM