两组的交集和差集

Question

Given two sets a and b that both contain integers, I would like to create another set c that contains all integers that are in a and b and additionally each integer that is in a xor b with probability 1/2, eg:给定两个集合a和b都包含整数，我想创建另一个集合c ，其中包含a和b中的所有整数，另外每个 integer 都在a xor b中，概率为 1/2，例如：

a={1,2,3,4}, b={1,2,5}
The result of function(a,b) could be c={1,2,5} or c={1,2,3,4,5} or c={1,2,3,5} or c={1,2,3,4} ....

This is a bottleneck in my code and is done iteratively many times.这是我的代码中的一个瓶颈，并且需要多次迭代。 Currently my code is:目前我的代码是：

def function(a, b):
    c = a & b
    c_temp = list(a ^ b)

    for x in range(len(c_temp)):
        if random.random() < 0.5:
            c.add(c_temp[x])
    return c

Could this be done faster?这可以更快地完成吗？ Thanks!谢谢！

Answer 1

I believe so!我相信是这样！

Try the code below, which takes the loop out and let's the random module select from the xor set, which will be faster.试试下面的代码，它取出循环，让我们从 xor 集中随机模块 select，这会更快。 I used the binomial distribution to determine how many should be selected, which is the correct way to do this with each element being considered with p=0.5我使用二项分布来确定应该选择多少个，这是正确的方法，每个元素都被考虑为 p=0.5

#random selection

import numpy as np
import random


def f2(a, b):
    c = a & b
    xor_stuff = a^b
    xor_selected = random.sample(xor_stuff, np.random.binomial(len(xor_stuff), p=0.5))
    c.update(xor_selected)
    return c

a = {1, 2, 3, 4, 5, 6}
b =          {4, 5, 6, 7, 8, 9}

for trial in range(5):
    print(f2(a,b))

Yields:产量：

{3, 4, 5, 6}
{1, 4, 5, 6, 7}
{2, 4, 5, 6, 7, 8, 9}
{1, 2, 4, 5, 6, 9}
{1, 2, 4, 5, 6}
[Finished in 0.2s]

---- Some speed testing of solutions. ---- 一些解决方案的速度测试。 ---- ----

4 variants: 4 种变体：

# original
def f1(a, b):
    c = a & b
    c_temp = list(a ^ b)

    for x in range(len(c_temp)):
        if random.random() < 0.5:
            c.add(c_temp[x])
    return c


def f2(a, b):
    c = a & b
    xor_stuff = a^b
    xor_selected = random.sample(xor_stuff, np.random.binomial(len(xor_stuff), p=0.5))
    c.update(xor_selected)
    return c

def f3(a, b):
    c = a & b
    st = list(a ^ b)
    c.update(np.array(st)[np.random.random(len(st)) > 0.5])
    return c

def f4(a, b):
    c = a & b

    for x in a ^ b:
        if random.random() < 0.5:
            c.add(x)
    return c

test_size = 1000
a2 = {random.randint(0, 10_000_000) for t in range(test_size)}
b2 = {random.randint(0, 10_000_000) for t in range(test_size)}

Results...结果...

(Sadly, mine is slowest. surprised..: :( ) （可悲的是，我的速度最慢。惊讶..::(）

In [25]: %timeit f1(a2, b2)                                                     
391 µs ± 1.35 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [26]: %timeit f2(a2, b2)                                                     
644 µs ± 2.47 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [27]: %timeit f3(a2, b2)                                                     
365 µs ± 1.22 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [28]: %timeit f4(a2, b2)                                                     
342 µs ± 2.16 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Answer 2

The list is unnecessary, and range-len iteration is slower than direct iteration.该列表是不必要的，并且 range-len 迭代比直接迭代慢。 You can iterate over a ^ b directly:您可以直接迭代a ^ b ：

def function(a, b):
    c = a & b

    for x in a ^ b:
        if random.random() < 0.5:
            c.add(x)
    return c

Answer 3

I think making a uniform continuous random variable for a binary choice is a bit wasteful.我认为为二元选择制作一个统一的连续随机变量有点浪费。 So here is a suggestion using random.getrandbits :所以这里有一个使用random.getrandbits的建议：

import random
import itertools

def pp(a,b):
    out = a&b
    ab = a^b
    if ab:
        bitfield = map("1".__eq__,reversed(bin(random.getrandbits(len(ab)))))
        out.update(itertools.compress(ab,bitfield))
    return out

Alternatively, and perhaps clearer:或者，也许更清楚：

        bitfield = map("1".__eq__,f"{random.getrandbits(len(ab)):0{len(ab)}b}")

... ...

两组的交集和差集

问题描述

3 个解决方案

解决方案1
2 2020-07-07 20:54:08

Yields:产量：

---- Some speed testing of solutions. ---- 一些解决方案的速度测试。 ---- ----

4 variants: 4 种变体：

Results...结果...

解决方案2
1 已采纳 2020-07-07 21:17:50

解决方案3
1 2020-07-07 23:23:18

两组的交集和差集

问题描述

3 个解决方案

解决方案1 2 2020-07-07 20:54:08

Yields:产量：

---- Some speed testing of solutions. ---- 一些解决方案的速度测试。 ---- ----

4 variants: 4 种变体：

Results...结果...

解决方案2 1 已采纳 2020-07-07 21:17:50

解决方案3 1 2020-07-07 23:23:18

解决方案1
2 2020-07-07 20:54:08

解决方案2
1 已采纳 2020-07-07 21:17:50

解决方案3
1 2020-07-07 23:23:18