简体   繁体   English

为列表中的每个唯一值分配一个数字

[英]Assign a number to each unique value in a list

I have a list of strings.我有一个字符串列表。 I want to assign a unique number to each string (the exact number is not important), and create a list of the same length using these numbers, in order.我想为每个字符串分配一个唯一的数字(确切的数字并不重要),并按顺序使用这些数字创建一个长度相同的列表。 Below is my best attempt at it, but I am not happy for two reasons:下面是我最好的尝试,但我不高兴有两个原因:

  1. It assumes that the same values are next to each other它假设相同的值彼此相邻

  2. I had to start the list with a 0 , otherwise the output would be incorrect我必须以0开始列表,否则输出将不正确

My code:我的代码:

names = ['ll', 'll', 'll', 'hl', 'hl', 'hl', 'LL', 'LL', 'LL', 'HL', 'HL', 'HL']
numbers = [0]
num = 0
for item in range(len(names)):
    if item == len(names) - 1:
      break
    elif names[item] == names[item+1]:
        numbers.append(num)
    else:
        num = num + 1
        numbers.append(num)
print(numbers)

I want to make the code more generic, so it will work with an unknown list.我想让代码更通用,所以它可以使用未知列表。 Any ideas?有任何想法吗?

Without using an external library (check the EDIT for a Pandas solution) you can do it as follows :不使用外部库(检查Pandas解决方案的EDIT ),您可以按如下方式进行:

d = {ni: indi for indi, ni in enumerate(set(names))}
numbers = [d[ni] for ni in names]

Brief explanation:简要说明:

In the first line, you assign a number to each unique element in your list (stored in the dictionary d ; you can easily create it using a dictionary comprehension; set returns the unique elements of names ).在第一行中,您为列表中的每个唯一元素分配一个数字(存储在字典d ;您可以使用字典理解轻松创建它; set返回names的唯一元素)。

Then, in the second line, you do a list comprehension and store the actual numbers in the list numbers .然后,在第二行中,您执行列表理解并将实际数字存储在列表numbers

One example to illustrate that it also works fine for unsorted lists:一个例子来说明它也适用于未排序的列表:

# 'll' appears all over the place
names = ['ll', 'll', 'hl', 'hl', 'hl', 'LL', 'LL', 'll', 'LL', 'HL', 'HL', 'HL', 'll']

That is the output for numbers :这是numbers的输出:

[1, 1, 3, 3, 3, 2, 2, 1, 2, 0, 0, 0, 1]

As you can see, the number 1 associated with ll appears at the correct places.如您所见,与ll关联的数字1出现在正确的位置。

EDIT编辑

If you have Pandas available, you can also use pandas.factorize (which seems to be quite efficient for huge lists and also works fine for lists of tuples as explained here ):如果你有可用的Pandas ,你也可以使用pandas.factorize (这对于巨大的列表似乎非常有效,对于元组列表也很好用,正如这里解释的那样):

import pandas as pd

pd.factorize(names)

will then return然后会回来

(array([(array([0, 0, 1, 1, 1, 2, 2, 0, 2, 3, 3, 3, 0]),
 array(['ll', 'hl', 'LL', 'HL'], dtype=object))

Therefore,所以,

numbers = pd.factorize(names)[0]

If the condition is that the numbers are unique and the exact number is not important , then you can build a mapping relating each item in the list to a unique number on the fly, assigning values from a count object:如果条件是数字是唯一的并且确切的数字并不重要,那么您可以构建一个将列表中的每个项目动态关联到唯一数字的映射,从计数对象分配值:

from itertools import count

names = ['ll', 'll', 'hl', 'hl', 'LL', 'LL', 'LL', 'HL', 'll']

d = {}
c = count()
numbers = [d.setdefault(i, next(c)) for i in names]
print(numbers)
# [0, 0, 2, 2, 4, 4, 4, 7, 0]

You could do away with the extra names by using map on the list and a count object, and setting the map function as {}.setdefault (see @StefanPochmann's comment):您可以通过在列表上使用map和 count 对象,并将 map 函数设置为{}.setdefault (请参阅 @StefanPochmann 的评论)来取消额外的名称:

from itertools import count

names = ['ll', 'll', 'hl', 'hl', 'LL', 'LL', 'LL', 'HL', 'll']
numbers  = map({}.setdefault, names, count()) # call list() on map for Py3
print(numbers)
# [0, 0, 2, 2, 4, 4, 4, 7, 0]

As an extra, you could also use np.unique , in case you already have numpy installed:另外,如果您已经安装了 numpy,您还可以使用np.unique

import numpy as np

_, numbers = np.unique(names, return_inverse=True)
print(numbers)
# [3 3 2 2 1 1 1 0 3]

If you have k different values, this maps them to integers 0 to k-1 in order of first appearance:如果您有k不同的值,这将它们按首次出现的顺序映射到整数0k-1

>>> names = ['b', 'c', 'd', 'c', 'b', 'a', 'b']
>>> tmp = {}
>>> [tmp.setdefault(name, len(tmp)) for name in names]
[0, 1, 2, 1, 0, 3, 0]

To make it more generic you can wrap it in a function, so these hard-coded values don't do any harm, because they are local.为了使它更通用,您可以将它包装在一个函数中,因此这些硬编码的值不会造成任何伤害,因为它们是本地的。

If you use efficient lookup-containers (I'll use a plain dictionary) you can keep the first index of each string without loosing to much performance:如果您使用高效的查找容器(我将使用普通字典),您可以保留每个字符串的第一个索引而不会损失太多性能:

def your_function(list_of_strings):

    encountered_strings = {}
    result = []

    idx = 0
    for astring in list_of_strings:
        if astring in encountered_strings:  # check if you already seen this string
            result.append(encountered_strings[astring])
        else:
            encountered_strings[astring] = idx
            result.append(idx)
            idx += 1
    return result

And this will assign the indices in order (even if that's not important):这将按顺序分配索引(即使这并不重要):

>>> your_function(['ll', 'll', 'll', 'hl', 'hl', 'hl', 'LL', 'LL', 'LL', 'HL', 'HL', 'HL'])
[0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3]

This needs only one iteration over your list of strings, which makes it possible to even process generators and similar.这只需要对您的字符串列表进行一次迭代,这使得甚至可以处理生成器等。

I managed to modify your script very slightly and it looks ok:我设法非常轻微地修改了您的脚本,它看起来不错:

names = ['ll', 'hl', 'll', 'hl', 'LL', 'll', 'LL', 'HL', 'hl', 'HL', 'LL', 'HL', 'zzz']
names.sort()
print(names)
numbers = []
num = 0
for item in range(len(names)):
    if item == len(names) - 1:
      break
    elif names[item] == names[item+1]:
        numbers.append(num)
    else:
        numbers.append(num)
        num = num + 1
numbers.append(num)
print(numbers)

You can see it is very simmilar, only thing is that instead adding number for NEXT element i add number for CURRENT element.您可以看到它非常相似,唯一的问题是为 NEXT 元素添加数字而不是为 CURRENT 元素添加数字。 That's all.就这样。 Oh, and sorting.哦,还有排序。 It sorts capital first, then lowercase in this example, you can play with sort(key= lambda:x ...) if you wish to change that.在这个例子中,它首先对大写进行排序,然后是小写,如果你想改变它,你可以使用sort(key= lambda:x ...) (Perhaps like this: names.sort(key = lambda x: (x.upper() if x.lower() == x else x.lower())) ) (也许是这样的: names.sort(key = lambda x: (x.upper() if x.lower() == x else x.lower()))

Since you are mapping strings to integers, that suggests using a dict.由于您将字符串映射到整数,因此建议使用 dict。 So you can do the following:因此,您可以执行以下操作:

d = dict()

counter = 0

for name in names:
    if name in d:
        continue
    d[name] = counter
    counter += 1

numbers = [d[name] for name in names]

Here is a similar factorizing solution with collections.defaultdict and itertools.count :这是一个类似的带有collections.defaultdictitertools.count 分解解决方案:

import itertools as it
import collections as ct


names = ['ll', 'll', 'hl', 'hl', 'LL', 'LL', 'LL', 'HL', 'll']

dd = ct.defaultdict(it.count().__next__)
[dd[i] for i in names]
# [0, 0, 1, 1, 2, 2, 2, 3, 0]

Every new occurrence calls the next integer in itertools.count and adds new entry to dd .每个新出现都会调用itertools.count的下一个整数并将新条目添加到dd

Pandas ' factorize can simply factorize unique strings: Pandasfactorize可以简单地分解唯一字符串:

import pandas as pd

codes, uniques = pd.factorize(names)
codes
>>> array([3, 3, 3, 2, 2, 2, 1, 1, 1, 0, 0, 0])

This can also be done in Scikit-learn with LabelEncoder() :这也可以在Scikit-learn 中使用LabelEncoder()

from sklearn import preprocessing

le = preprocessing.LabelEncoder()
codes = le.fit_transform(names)
codes
>>> array([3, 3, 3, 2, 2, 2, 1, 1, 1, 0, 0, 0])

You can Try This Also:-你也可以试试这个:-

names = ['ll', 'll', 'll', 'hl', 'hl', 'hl', 'LL', 'LL', 'LL', 'HL', 'HL', 'HL']

indexList = list(set(names))

print map(lambda name:indexList.index(name),names)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM