简体   繁体   English

Python:在没有排序列表的情况下使用渐进编号重命名列表中的重复项

[英]Python: Rename duplicates in list with progressive numbers without sorting list

Given a list like this:给出这样的列表:

mylist = ["name", "state", "name", "city", "name", "zip", "zip"]

I would like to rename the duplicates by appending a number to get the following result:我想通过附加一个数字来重命名重复项以获得以下结果:

mylist = ["name1", "state", "name2", "city", "name3", "zip1", "zip2"]

I do not want to change the order of the original list.我不想更改原始列表的顺序。 The solutions suggested for this related Stack Overflow question sorts the list, which I do not want to do.为这个相关的 Stack Overflow 问题建议的解决方案对列表进行了排序,我不想这样做。

My solution with map and lambda : 我对maplambda解决方案:

print map(lambda x: x[1] + str(mylist[:x[0]].count(x[1]) + 1) if mylist.count(x[1]) > 1 else x[1], enumerate(mylist))

More traditional form 更传统的形式

newlist = []
for i, v in enumerate(mylist):
    totalcount = mylist.count(v)
    count = mylist[:i].count(v)
    newlist.append(v + str(count + 1) if totalcount > 1 else v)

And last one 最后一个

[v + str(mylist[:i].count(v) + 1) if mylist.count(v) > 1 else v for i, v in enumerate(mylist)]

This is how I would do it. 这就是我要做的。 EDIT: I wrote this into a more generalized utility function since people seem to like this answer. 编辑:因为人们似乎喜欢这个答案,所以我将此写入了一个更通用的实用函数。

mylist = ["name", "state", "name", "city", "name", "zip", "zip"]
check = ["name1", "state", "name2", "city", "name3", "zip1", "zip2"]
copy = mylist[:]  # so we will only mutate the copy in case of failure

from collections import Counter # Counter counts the number of occurrences of each item
from itertools import tee, count

def uniquify(seq, suffs = count(1)):
    """Make all the items unique by adding a suffix (1, 2, etc).

    `seq` is mutable sequence of strings.
    `suffs` is an optional alternative suffix iterable.
    """
    not_unique = [k for k,v in Counter(seq).items() if v>1] # so we have: ['name', 'zip']
    # suffix generator dict - e.g., {'name': <my_gen>, 'zip': <my_gen>}
    suff_gens = dict(zip(not_unique, tee(suffs, len(not_unique))))  
    for idx,s in enumerate(seq):
        try:
            suffix = str(next(suff_gens[s]))
        except KeyError:
            # s was unique
            continue
        else:
            seq[idx] += suffix

uniquify(copy)
assert copy==check  # raise an error if we failed
mylist = copy  # success

If you wanted to append an underscore before each count, you could do something like this: 如果您想在每个计数之前添加下划线,则可以执行以下操作:

>>> mylist = ["name", "state", "name", "city", "name", "zip", "zip"]
>>> uniquify(mylist, (f'_{x!s}' for x in range(1, 100)))
>>> mylist
['name_1', 'state', 'name_2', 'city', 'name_3', 'zip_1', 'zip_2']

...or if you wanted to use letters instead: ...或者如果您想使用字母代替:

>>> mylist = ["name", "state", "name", "city", "name", "zip", "zip"]
>>> import string
>>> uniquify(mylist, (f'_{x!s}' for x in string.ascii_lowercase))
>>> mylist
['name_a', 'state', 'name_b', 'city', 'name_c', 'zip_a', 'zip_b']

NOTE: this is not the fastest possible algorithm; 注意:这不是最快的算法。 for that, refer to the answer by ronakg . 为此,请参考ronakg的答案 The advantage of the function above is it is easy to understand and read, and you're not going to see much of a performance difference unless you have an extremely large list. 上面函数的优点是易于理解和阅读,除非您的列表非常大,否则您不会看到很多性能差异。

EDIT: Here is my original answer in a one-liner, however the order is not preserved and it uses the .index method, which is extremely suboptimal (as explained in the answer by DTing ). 编辑:这是我最初的回答,但是不保留顺序,它使用.index方法,这是次优的(如DTing的回答所述 )。 See the answer by queezz for a nice 'two-liner' that preserves order. 请参阅queezz的答案,以获取可保留秩序的漂亮“两线”。

[s + str(suffix) if num>1 else s for s,num in Counter(mylist).items() for suffix in range(1, num+1)]
# Produces: ['zip1', 'zip2', 'city', 'state', 'name1', 'name2', 'name3']

Any method where count is called on each element is going to result in O(n^2) since count is O(n) . 因为countO(n) ,所以在每个元素上调用count任何方法都将导致O(n^2) O(n) You can do something like this: 您可以执行以下操作:

# not modifying original list
from collections import Counter

mylist = ["name", "state", "name", "city", "name", "zip", "zip"]
counts = {k:v for k,v in Counter(mylist).items() if v > 1}
newlist = mylist[:]

for i in reversed(range(len(mylist))):
    item = mylist[i]
    if item in counts and counts[item]:
        newlist[i] += str(counts[item])
        counts[item]-=1
print(newlist)

# ['name1', 'state', 'name2', 'city', 'name3', 'zip1', 'zip2']

# modifying original list
from collections import Counter

mylist = ["name", "state", "name", "city", "name", "zip", "zip"]
counts = {k:v for k,v in Counter(mylist).items() if v > 1}      

for i in reversed(range(len(mylist))):
    item = mylist[i]
    if item in counts and counts[item]:
        mylist[i] += str(counts[item])
        counts[item]-=1
print(mylist)

# ['name1', 'state', 'name2', 'city', 'name3', 'zip1', 'zip2']

This should be O(n) . 这应该是O(n)

Other provided answers: 其他提供的答案:

mylist.index(s) per element causes O(n^2) 每个元素mylist.index(s)导致O(n^2)

mylist = ["name", "state", "name", "city", "name", "zip", "zip"]

from collections import Counter
counts = Counter(mylist)
for s,num in counts.items():
    if num > 1:
        for suffix in range(1, num + 1):
            mylist[mylist.index(s)] = s + str(suffix) 

count(x[1]) per element causes O(n^2) 每个元素的count(x[1])会导致O(n^2)
It is also used multiple times per element along with list slicing. 每个元素还与列表切片一起多次使用。

print map(lambda x: x[1] + str(mylist[:x[0]].count(x[1]) + 1) if mylist.count(x[1]) > 1 else x[1], enumerate(mylist))

Benchmarks: 基准测试:

http://nbviewer.ipython.org/gist/dting/c28fb161de7b6287491b http://nbviewer.ipython.org/gist/dting/c28fb161de7b6287491b

Here's a very simple O(n) solution. 这是一个非常简单的O(n)解决方案。 Simply walk the list storing the index of element in the list. 只需遍历存储列表中元素索引的列表即可。 If we've seen this element before, use the stored data earlier to append the occurrence value. 如果我们之前看过此元素,请更早使用存储的数据附加出现值。

This approach solves the problem with just creating one more dictionary for look-back. 这种方法通过仅创建一个更多的字典来解决问题。 Avoids doing look-ahead so that we don't create temporary list slices. 避免前瞻,以免我们不创建临时列表片。

mylist = ["name", "state", "name", "city", "city", "name", "zip", "zip", "name"]

dups = {}

for i, val in enumerate(mylist):
    if val not in dups:
        # Store index of first occurrence and occurrence value
        dups[val] = [i, 1]
    else:
        # Special case for first occurrence
        if dups[val][1] == 1:
            mylist[dups[val][0]] += str(dups[val][1])

        # Increment occurrence value, index value doesn't matter anymore
        dups[val][1] += 1

        # Use stored occurrence value
        mylist[i] += str(dups[val][1])

print mylist

# ['name1', 'state', 'name2', 'city1', 'city2', 'name3', 'zip1', 'zip2', 'name4']

A list comprehension version of the Rick Teachey answer , "two-liner": 里克·泰切(Rick Teachey) 回答 “两线”的列表理解版本:

from collections import Counter

m = ["name", "state", "name", "city", "name", "zip", "zip"]

d = {a:list(range(1, b+1)) if b>1 else '' for a,b in Counter(m).items()}
[i+str(d[i].pop(0)) if len(d[i]) else i for i in m]
#['name1', 'state', 'name2', 'city', 'name3', 'zip1', 'zip2']

You can use hashtable to solve this problem. 您可以使用哈希表来解决此问题。 Define a dictionary d. 定义字典d。 key is the string and value is (first_time_index_in_the_list, times_of_appearance). 键是字符串,值是(first_time_index_in_the_list,times_of_appearance)。 Everytime when you see a word, just check the dictionary, and if the value is 2, use first_time_index_in_the_list to append '1' to the first element, and append times_of_appearance to current element. 每次看到一个单词时,只需检查字典,如果值是2,请使用first_time_index_in_the_list将'1'附加到第一个元素,并将times_of_appearance附加到当前元素。 If greater than 2, just append times_of_appearance to current element. 如果大于2,则只需将times_of_appearance附加到当前元素。

Less fancy stuff. 少花哨的东西。

from collections import defaultdict
mylist = ["name", "state", "name", "city", "name", "zip", "zip"]
finalList = []
dictCount = defaultdict(int)
anotherDict = defaultdict(int)
for t in mylist:
   anotherDict[t] += 1
for m in mylist:
   dictCount[m] += 1
   if anotherDict[m] > 1:
       finalList.append(str(m)+str(dictCount[m]))
   else:
       finalList.append(m)
print finalList

Beware of updated values that already exist in the original list当心原始列表中已存在的更新值

If the starting list already includes an item "name2" ...如果起始列表已经包含项目"name2" ...

mylist = ["name", "state", "name", "city", "name", "zip", "zip", "name2"]

...then mylist[2] shouldn't be updated to "name2" when the function runs, otherwise a new duplicate will be created; ...然后mylist[2]不应在 function 运行时更新为"name2" ,否则将创建一个新的副本; instead, the function should jump to the next available item name "name3" .相反, function 应该跳转到下一个可用的项目名称"name3"

mylist_updated = ['name1', 'state', 'name3', 'city', 'name4', 'zip1', 'zip2', 'name2']

Here's an alternate solution (can probably be shortened and optimized) which includes a recursive function that checks for these existing items in the original list.这是一个替代解决方案(可能会缩短和优化),其中包括一个递归 function,用于检查原始列表中的这些现有项目。

mylist = ["name", "state", "name", "city", "name", "zip", "zip", "name2"]

def fix_dups(mylist, sep='', start=1, update_first=True):
    mylist_dups = {}
    #build dictionary containing val: [occurrences, suffix]
    for val in mylist:
        if val not in mylist_dups:
            mylist_dups[val] = [1, start - 1]
        else:
            mylist_dups[val][0] += 1
            
    #define function to update duplicate values with suffix, check if updated value already exists
    def update_val(val, num):
        temp_val = sep.join([str(x) for x in [val, num]])
        if temp_val not in mylist_dups:
            return temp_val, num
        else:
            num += 1
            return update_val(val, num)        
    
    #update list
    for i, val in enumerate(mylist):
        if mylist_dups[val][0] > 1:
            mylist_dups[val][1] += 1  
            if update_first or mylist_dups[val][1] > start:
                new_val, mylist_dups[val][1] = update_val(val, mylist_dups[val][1])
                mylist[i] = new_val

    return mylist
                
mylist_updated = fix_dups(mylist, sep='', start=1, update_first=True)
print(mylist_updated)
#['name1', 'state', 'name3', 'city', 'name4', 'zip1', 'zip2', 'name2']

In case you don't want to change the first occurrence.如果您不想更改第一次出现的情况。

mylist = ["name", "state", "name", "city", "name", "zip", "zip", "name_2"]
             
mylist_updated = fix_dups(mylist, sep='_', start=0, update_first=False)
print(mylist_updated)
#['name', 'state', 'name_1', 'city', 'name_3', 'zip', 'zip_1', 'name_2']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM