繁体   English   中英

Python中的random.sample和random.shuffle有什么区别

[英]What is the difference between random.sample and random.shuffle in Python

我有一个包含 1500 个元素的列表 a_tot,我想以随机方式将这个列表分成两个列表。 列表 a_1 将有 1300 个元素,列表 a_2 将有 200 个元素。 我的问题是关于用 1500 个元素随机化原始列表的最佳方法。 当我将列表随机化时,我可以用 1300 取一个切片,用 200 取另一个切片。一种方法是使用 random.shuffle,另一种方法是使用 random.sample。 两种方法之间的随机化质量有什么不同吗? 列表 1 中的数据应该是随机样本以及列表 2 中的数据。 有什么建议吗? 使用随机播放:

random.shuffle(a_tot)    #get a randomized list
a_1 = a_tot[0:1300]     #pick the first 1300
a_2 = a_tot[1300:]      #pick the last 200

使用样品

new_t = random.sample(a_tot,len(a_tot))    #get a randomized list
a_1 = new_t[0:1300]     #pick the first 1300
a_2 = new_t[1300:]      #pick the last 200

shuffle 的来源:

def shuffle(self, x, random=None, int=int):
    """x, random=random.random -> shuffle list x in place; return None.

    Optional arg random is a 0-argument function returning a random
    float in [0.0, 1.0); by default, the standard random.random.
    """

    if random is None:
        random = self.random
    for i in reversed(xrange(1, len(x))):
        # pick an element in x[:i+1] with which to exchange x[i]
        j = int(random() * (i+1))
        x[i], x[j] = x[j], x[i]

样品来源:

def sample(self, population, k):
    """Chooses k unique random elements from a population sequence.

    Returns a new list containing elements from the population while
    leaving the original population unchanged.  The resulting list is
    in selection order so that all sub-slices will also be valid random
    samples.  This allows raffle winners (the sample) to be partitioned
    into grand prize and second place winners (the subslices).

    Members of the population need not be hashable or unique.  If the
    population contains repeats, then each occurrence is a possible
    selection in the sample.

    To choose a sample in a range of integers, use xrange as an argument.
    This is especially fast and space efficient for sampling from a
    large population:   sample(xrange(10000000), 60)
    """

    # XXX Although the documentation says `population` is "a sequence",
    # XXX attempts are made to cater to any iterable with a __len__
    # XXX method.  This has had mixed success.  Examples from both
    # XXX sides:  sets work fine, and should become officially supported;
    # XXX dicts are much harder, and have failed in various subtle
    # XXX ways across attempts.  Support for mapping types should probably
    # XXX be dropped (and users should pass mapping.keys() or .values()
    # XXX explicitly).

    # Sampling without replacement entails tracking either potential
    # selections (the pool) in a list or previous selections in a set.

    # When the number of selections is small compared to the
    # population, then tracking selections is efficient, requiring
    # only a small set and an occasional reselection.  For
    # a larger number of selections, the pool tracking method is
    # preferred since the list takes less space than the
    # set and it doesn't suffer from frequent reselections.

    n = len(population)
    if not 0 <= k <= n:
        raise ValueError, "sample larger than population"
    random = self.random
    _int = int
    result = [None] * k
    setsize = 21        # size of a small set minus size of an empty list
    if k > 5:
        setsize += 4 ** _ceil(_log(k * 3, 4)) # table size for big sets
    if n <= setsize or hasattr(population, "keys"):
        # An n-length list is smaller than a k-length set, or this is a
        # mapping type so the other algorithm wouldn't work.
        pool = list(population)
        for i in xrange(k):         # invariant:  non-selected at [0,n-i)
            j = _int(random() * (n-i))
            result[i] = pool[j]
            pool[j] = pool[n-i-1]   # move non-selected item into vacancy
    else:
        try:
            selected = set()
            selected_add = selected.add
            for i in xrange(k):
                j = _int(random() * n)
                while j in selected:
                    j = _int(random() * n)
                selected_add(j)
                result[i] = population[j]
        except (TypeError, KeyError):   # handle (at least) sets
            if isinstance(population, list):
                raise
            return self.sample(tuple(population), k)
    return result

如您所见,在这两种情况下,随机化基本上是由行int(random() * n) 因此,底层算法本质上是相同的。

random.shuffle()给定的list 它的长度保持不变。

random.sample()从给定序列中挑选n项目而无需替换(也可以是元组或其他任何东西,只要它有__len__() )并以随机顺序返回它们。

shuffle()sample()之间有两个主要区别:

1) Shuffle 将就地更改数据,因此其输入必须是可变序列。 相比之下,sample 生成一个新列表,它的输入可以有更多的变化(元组、字符串、xrange、字节数组、集合等)。

2) Sample 可以让你做更少的工作(即部分洗牌)。

通过证明可以根据sample()实现shuffle()来展示两者之间的概念关系很有趣:

def shuffle(p):
   p[:] = sample(p, len(p))

反之亦然,根据shuffle()实现sample ()

def sample(p, k):
   p = list(p)
   shuffle(p)
   return p[:k]

在 shuffle() 和 sample() 的实际实现中,这两者都没有那么高效,但它确实显示了它们的概念关系。

我认为它们完全相同,只是一个更新了原始列表,一个使用(只读)它。 质量没有差别。

这两种选择的随机化应该一样好。 我会说使用shuffle ,因为读者可以更清楚地了解它的作用。

from random import shuffle
from random import sample 
x = [[i] for i in range(10)]
shuffle(x)
sample(x,10)

shuffle 更新相同列表中的输出,但样本返回更新列表样本提供 pic 设施中的参数编号,但 shuffle 提供相同长度输入的列表

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM