python 使用“模板列表”创建新列表

Question

假设我有：

x1 = [1, 3, 2, 4]

和：

x2 = [0, 1, 1, 0]

具有相同的形状

现在我想“将 x2 放在 x1 上”并总结与 x2 的数量相对应的所有 x1 的数量

所以最终结果是：

end = [1+4 ,3+2]  # end[0] is the sum of all numbers of x1 where a 0 was in x2

这是一个使用列表来进一步澄清问题的幼稚实现

store_0 = 0
store_1 = 0
x1 = [1, 3, 4, 2]
x2 = [0, 1, 1, 0]
for value_x1 ,value_x2 in zip(x1 ,x2):
    if value_x2 == 0:
        store_0 += value_x1
    elif value_x2 == 1:
        store_1 += value_x1

所以我的问题是：有没有一种方法可以在 numpy 中实现这一点，而不使用循环或通常更快？

Answer 1

在这个特定的示例中（通常，对于unique 、 duplicated和groupby类型的操作）， pandas比纯numpy解决方案更快：

使用Series的pandas方式（信用：与@mcsoini 的答案非常相似）：

def pd_group_sum(x1, x2):
    return pd.Series(x1, index=x2).groupby(x2).sum()

一个纯粹的numpy方式，使用np.unique和一些花哨的索引：

def np_group_sum(a, groups):
    _, ix, rix = np.unique(groups, return_index=True, return_inverse=True)
    return np.where(np.arange(len(ix))[:, None] == rix, a, 0).sum(axis=1)

注意：更好的纯numpy方式的灵感来自@Woodford 的回答：

def selsum(a, g, e):
    return a[g==e].sum()

vselsum = np.vectorize(selsum, signature='(n),(n),()->()')

def np_group_sum2(a, groups):
    return vselsum(a, groups, np.unique(groups))

另一种纯粹的numpy方式的灵感来自@mapf 关于使用argsort()的评论。 这本身已经花费了 45 毫秒，但我们可以尝试基于np.argpartition(x2, len(x2)-1)的东西，因为在下面的基准测试中它本身只需要 7.5 毫秒：

def np_group_sum3(a, groups):
    ix = np.argpartition(groups, len(groups)-1)
    ends = np.nonzero(np.diff(np.r_[groups[ix], groups.max() + 1]))[0]
    return np.diff(np.r_[0, a[ix].cumsum()[ends]])

（稍作修改）示例

x1 = np.array([1, 3, 2, 4, 8])  # I added a group for sake of generality
x2 = np.array([0, 1, 1, 0, 7])

>>> pd_group_sum(x1, x2)
0    5
1    5
7    8

>>> np_group_sum(x1, x2)  # and all the np_group_sum() variants
array([5, 5, 8])

速度

n = 1_000_000
x1 = np.random.randint(0, 20, n)
x2 = np.random.randint(0, 20, n)

%timeit pd_group_sum(x1, x2)
# 13.9 ms ± 65.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit np_group_sum(x1, x2)
# 171 ms ± 129 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit np_group_sum2(x1, x2)
# 66.7 ms ± 19.4 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit np_group_sum3(x1, x2)
# 25.6 ms ± 41.3 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

通过 pandas 更快，部分原因是numpy 问题 11136 。

Answer 2

>>> x1 = np.array([1, 3, 2, 7])
>>> x2 = np.array([0, 1, 1, 0])
>>> for index in np.unique(x2):
>>>     print(f'{index}: {x1[x2==index].sum()}')
0: 8
1: 5
>>> # or in one line
>>> [(index, x1[x2==index].sum()) for index in np.unique(x2)]
[(0, 8), (1, 5)]

Answer 3

pandas 单线可以吗？

store_0, store_1 = pd.DataFrame({"x1": x1, "x2": x2}).groupby("x2").x1.sum()

或者作为字典，对于x2中的任意多个值：

pd.DataFrame({"x1": x1, "x2": x2}).groupby("x2").x1.sum().to_dict()

Output：

{0: 5, 1: 5}

Answer 4

使用压缩

from itertools import compress
result = [sum(compress(x1,x2)),sum(compress(x1, (map(lambda x: not x,x2))))]

Answer 5

这会将您的循环扩展到更多的值。 我想不出一个 numpy 单线来做到这一点。

sums = [0] * 10000
for vx1,vx2 in zip(x1,x2):
    sums[vx2] += vx1

Answer 6

通过将第二个列表转换为 Boolean 数组，您可以使用它来索引第一个：

import numpy as np

x1 = np.array([1, 3, 2, 4])
x2 = np.array([0, 1, 1, 0], dtype=bool)

end = [np.sum(x1[~x2]), np.sum(x1[x2])]
end

[5, 5]

编辑：如果x2的值可以大于 1，则可以使用列表推导：

x1 = np.array([1, 3, 2, 4])
x2 = np.array([0, 1, 1, 0])

end = [np.sum(x1[x2 == i]) for i in range(max(x2) + 1)]

Answer 7

这扩展了 Tim Roberts 在开始时建议的解决方案，但将说明X2具有多个值，即非二进制。 这里这些值是严格相邻的，因为 for 循环使用rng的range ，但它可以扩展，以便 x2 具有不相邻的值，例如 [0 2 2 2 1 4] <- no 3's 而用于此示例的randint将返回一个类似于 [0 1 1 3 4 2] 的向量。

import numpy as np
rng = 5 # Range of values for x2 i.e [0 1 2 3 4]
x1 = np.random.randint(20, size=10000) #random vector of size 10k
x2 = np.random.randint(5, size=10000) # inexing vector size 10k with range (0-4)


store = []
for i in range(rng): # loop and append to list
    store.append(x1[x2==i].sum())

python 使用“模板列表”创建新列表

问题描述

7 个解决方案

解决方案1
7 已采纳 2021-04-26 19:37:41

解决方案2
4 2021-04-26 19:03:06

解决方案3
3 2021-04-26 19:04:00

解决方案4
2 2021-04-26 19:24:25

解决方案5
1 2021-04-26 19:02:38

解决方案6
1 2021-04-26 19:08:50

解决方案7
1 2021-04-26 19:27:23

python 使用“模板列表”创建新列表

问题描述

7 个解决方案

解决方案1 7 已采纳 2021-04-26 19:37:41

解决方案2 4 2021-04-26 19:03:06

解决方案3 3 2021-04-26 19:04:00

解决方案4 2 2021-04-26 19:24:25

解决方案5 1 2021-04-26 19:02:38

解决方案6 1 2021-04-26 19:08:50

解决方案7 1 2021-04-26 19:27:23

解决方案1
7 已采纳 2021-04-26 19:37:41

解决方案2
4 2021-04-26 19:03:06

解决方案3
3 2021-04-26 19:04:00

解决方案4
2 2021-04-26 19:24:25

解决方案5
1 2021-04-26 19:02:38

解决方案6
1 2021-04-26 19:08:50

解决方案7
1 2021-04-26 19:27:23