简体   繁体   English

如何从 Python 中已知百分比的列表中选择项目

[英]How to select an item from a list with known percentages in Python

I wish to select a random word from a list where the is a known chance for each word, for example:我希望从列表中选择一个随机单词,其中每个单词的概率是已知的,例如:

Fruit with Probability概率果实

Orange 0.10 Apple 0.05 Mango 0.15 etc橙 0.10 苹果 0.05 芒果 0.15 等

How would be the best way of implementing this?实现这一点的最佳方式是什么? The actual list I will take from is up to 100 items longs and the % do not all tally to 100 % they do fall short to account for the items that had a really low chance of occurrence.我要从中获取的实际列表最多有 100 个项目,并且百分比并不全部达到 100%,因为它们确实不足,以说明发生几率非常低的项目。 I would ideally like to take this from a CSV which is where I store this data.理想情况下,我想从 CSV 中获取它,这是我存储这些数据的地方。 This is not a time critical task.这不是一项时间紧迫的任务。

Thank you for any advice on how best to proceed.感谢您提供有关如何最好地进行的任何建议。

You can pick items with weighted probabilities if you assign each item a number range proportional to its probability, pick a random number between zero and the sum of the ranges and find what item matches it.如果您为每个项目分配一个与其概率成正比的数字范围,在零和范围总和之间选择一个随机数并找到与它匹配的项目,则您可以选择具有加权概率的项目。 The following class does exactly that:下面的类正是这样做的:

from random import random

class WeightedChoice(object):
    def __init__(self, weights):
        """Pick items with weighted probabilities.

            weights
                a sequence of tuples of item and it's weight.
        """
        self._total_weight = 0.
        self._item_levels = []
        for item, weight in weights:
            self._total_weight += weight
            self._item_levels.append((self._total_weight, item))

    def pick(self):
        pick = self._total_weight * random()
        for level, item in self._item_levels:
            if level >= pick:
                return item

You can then load the CSV file with the csv module and feed it to the WeightedChoice class:然后,您可以使用csv模块加载 CSV 文件并将其提供给WeightedChoice类:

import csv

weighed_items = [(item,float(weight)) for item,weight in csv.reader(open('file.csv'))]
picker = WeightedChoice(weighed_items)
print(picker.pick())

What you want is to draw from a multinomial distribution .您想要的是从多项分布中提取。 Assuming you have two lists of items and probabilities, and the probabilities sum to 1 (if not, just add some default value to cover the extra):假设您有两个项目和概率列表,并且概率总和为 1(如果不是,只需添加一些默认值来覆盖额外的值):

def choose(items,chances):
    import random
    p = chances[0]
    x = random.random()
    i = 0
    while x > p :
        i = i + 1
        p = p + chances[i]
    return items[i]
lst = [ ('Orange', 0.10), ('Apple', 0.05), ('Mango', 0.15), ('etc', 0.69) ]

x = 0.0
lst2 = []
for fruit, chance in lst:
    tup = (x, fruit)
    lst2.append(tup)
    x += chance

tup = (x, None)
lst2.append(tup)

import random

def pick_one(lst2):
    if lst2[0][1] is None:
        raise ValueError, "no valid values to choose"
    while True:
        r = random.random()
        for x, fruit in reversed(lst2):
            if x <= r:
                if fruit is None:
                    break  # try again with a different random value
                else:
                    return fruit

pick_one(lst2)

This builds a new list, with ascending values representing the range of values that choose a fruit;这将构建一个新列表,其中升序值表示选择水果的值范围; then pick_one() walks backward down the list, looking for a value that is <= the current random value.然后 pick_one() 沿着列表向后走,寻找 <= 当前随机值的值。 We put a "sentinel" value on the end of the list;我们在列表的末尾放置了一个“哨兵”值; if the values don't reach 1.0, there is a chance of a random value that shouldn't match anything, and it will match the sentinel value and then be rejected.如果值未达到 1.0,则有可能出现不应该匹配任何内容的随机值,它将匹配标记值然后被拒绝。 random.random() returns a random value in the range [0.0, 1.0) so it is certain to match something in the list eventually. random.random() 返回 [0.0, 1.0) 范围内的随机值,因此最终肯定会匹配列表中的某些内容。

The nice thing here is that you should be able to have one value with a 0.000001 chance of matching, and it should actually match with that frequency;这里的好处是,您应该能够有一个匹配机会为 0.000001 的值,并且它实际上应该与该频率匹配; the other solutions, where you make a list with the items repeated and just use random.choice() to choose one, would require a list with a million items in it to handle this case.在其他解决方案中,您制作一个包含重复项的列表并仅使用 random.choice() 来选择一个列表,则需要一个包含一百万项的列表来处理这种情况。

lst = [ ('Orange', 0.10), ('Apple', 0.05), ('Mango', 0.15), ('etc', 0.69) ]

x = 0.0
lst2 = []
for fruit, chance in lst:
    low = x
    high = x + chance
    tup = (low, high, fruit)
    lst2.append(tup)
    x += chance

if x > 1.0:
    raise ValueError, "chances add up to more than 100%"

low = x
high = 1.0
tup = (low, high, None)
lst2.append(tup)

import random

def pick_one(lst2):
    if lst2[0][2] is None:
        raise ValueError, "no valid values to choose"
    while True:
        r = random.random()
        for low, high, fruit in lst2:
            if low <= r < high:
                if fruit is None:
                    break  # try again with a different random value
                else:
                    return fruit

pick_one(lst2)


# test it 10,000 times
d = {}
for i in xrange(10000):
    x = pick_one(lst2)
    if x in d:
        d[x] += 1
    else:
        d[x] = 1

I think this is a little clearer.我觉得这更清楚一些。 Instead of a tricky way of representing ranges as ascending values, we just keep ranges.我们只是保留范围,而不是将范围表示为升序值的棘手方法。 Because we are testing ranges, we can simply walk forward through the lst2 values;因为我们正在测试范围,所以我们可以简单地向前遍历 lst2 值; no need to use reversed() .无需使用reversed()

from numpy.random import multinomial
import numpy as np

def pickone(dist):
    return np.where(multinomial(1, dist) == 1)[0][0]

if __name__ == '__main__':
    lst = [ ('Orange', 0.10), ('Apple', 0.05), ('Mango', 0.15), ('etc', 0.70) ]
    dist = [p[1] for p in lst]
    
    N = 10000
    draws = np.array([pickone(dist) for i in range(N)], dtype=int)
    hist = np.histogram(draws, bins=[i for i in range(len(dist)+1)])[0]
    for i in range(len(lst)):
        print(f'{lst[i]} {hist[i]/N}')

One solution is to normalize the probabilities to integers and then repeat each element once per value (eg a list with 2 Oranges, 1 Apple, 3 Mangos).一种解决方案是将概率归一化为整数,然后对每个值重复每个元素一次(例如,包含 2 个橙子、1 个苹果、3 个芒果的列表)。 This is incredibly easy to do ( from random import choice ).这非常容易做到( from random import choice )。 If that is not practical, try the code here .如果这不切实际,请尝试此处的代码。

import random
d= {'orange': 0.10, 'mango': 0.15, 'apple': 0.05}
weightedArray = []
for k in d:
  weightedArray+=[k]*int(d[k]*100)
random.choice(weightedArray)

EDITS编辑

This is essentially what Brian said above.这基本上就是布赖恩上面所说的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM