Pandas - 根据另一个列表中的索引对列表中的值求和

Question

I am trying to find the most pythonic way to tackle down my problem in the short time as possible since I am dealing with a large amount of data. 我正在努力寻找最快速的方法来在短时间内解决我的问题，因为我正在处理大量数据。 My problem is the following: 我的问题如下：

I have two lists 我有两个清单

a = [12,34,674,2,0,5,6,8]
b = ['foo','bar','bar','foo','foo','bar','foo','foo']

I want to say python: if 'bar' is in b, take all the indexes and sum all values in list a with those indexes. 我想说python：如果'bar'在b中，取所有索引并将列表a中的所有值与这些索引相加。

This is what I have done so far: 这是我到目前为止所做的：

idx = [i for i, j in enumerate(a) if j == 'bar']

but then I am stacked. 但后来我堆积了。 I am considering using some wired for loops. 我正在考虑使用一些有线for循环。 Do you have any idea? 你有什么主意吗？

Answer 1

With numpy : 随着numpy ：

import numpy as np

a = np.array(a)
b = np.array(b)

a[b == 'bar'].sum()

Answer 2

Using np.bincount . 使用np.bincount 。 Computes both sums ('foo' and 'bar'). 计算两个总和（'foo'和'bar'）。

sum_foo, sum_bar = np.bincount(np.char.equal(b, 'bar'), a)
sum_foo
# 28.0
sum_bar
# 713.0

Note np.char.equal works on both lists and arrays. 注意np.char.equal适用于列表和数组。 If b is an array, then b == 'bar' can be used instead and is a bit faster. 如果b是一个数组，那么可以使用b == 'bar'代替并且更快一些。

Timings: 时序：

Even though this computes both sums it is actually pretty fast: 即使这计算两个总和，它实际上非常快：

timeit(lambda: np.bincount(b == 'bar', a))
# 2.406161994993454

Compare for example with the numpy masking method: 比较例如numpy masking方法：

timeit(lambda: a[b == 'bar'].sum())
# 5.642918559984537

On larger arrays masking becomes slightly faster which is expected since bincount does essentially 2x the work. 在较大的阵列上，掩蔽变得稍快，这是预期的，因为bincount基本上是工作的2 bincount 。 Still bincount takes less than 2x the time, so if you happen to need both sums ('foo' and 'bar'), bincount is still faster. 仍然bincount时间不到2倍，所以如果你碰巧需要两个总和（'foo'和'bar'）， bincount仍然更快。

aa = np.repeat(a, 1000)
bb = np.repeat(b, 1000)
timeit(lambda: aa[bb == 'bar'].sum(), number=1000)
# 0.07860603698645718
timeit(lambda:np.bincount(bb == 'bar', aa), number=1000)
# 0.11229897901648656

Answer 3

This is simple to do in pandas : 这在pandas很简单：

In[5]:
import pandas as pd
a = [12,34,674,2,0,5,6,8]
b = ['foo','bar','bar','foo','foo','bar','foo','foo']
df = pd.DataFrame({'a':a, 'b':b})
df

Out[5]: 
     a    b
0   12  foo
1   34  bar
2  674  bar
3    2  foo
4    0  foo
5    5  bar
6    6  foo
7    8  foo

In [8]: df.loc[df['b']=='bar','a'].sum()
Out[8]: 713

So here we take your lists and construct a dict in place for the data arg for the DataFrame ctor: 所以在这里我们采用你的列表并为DataFrame ctor的data arg构建一个dict ：

df = pd.DataFrame({'a':a, 'b':b})

Then we just mask the df using loc where we select the rows where 'b' == 'bar' and select the column 'a' and call sum() : 然后我们使用loc掩盖df，我们选择'b' == 'bar'并选择列'a'并调用sum() ：

df.loc[df['b']=='bar','a'].sum()

Answer 4

Use: 采用：

l = [x for x,y in zip(a,b) if y == 'bar']

If you want indexes: 如果你想要索引：

l = [i for (i,x),y in zip(enumerate(a),b) if y == 'bar']

Pandas - 根据另一个列表中的索引对列表中的值求和

问题描述

4 个解决方案

解决方案1
4 已采纳 2019-03-20 09:26:08

解决方案2
3 2019-03-20 09:33:21

解决方案3
0 2019-03-20 09:24:17

解决方案4
0 2019-03-20 09:25:02

Pandas - 根据另一个列表中的索引对列表中的值求和

问题描述

4 个解决方案

解决方案1 4 已采纳 2019-03-20 09:26:08

解决方案2 3 2019-03-20 09:33:21

解决方案3 0 2019-03-20 09:24:17

解决方案4 0 2019-03-20 09:25:02

解决方案1
4 已采纳 2019-03-20 09:26:08

解决方案2
3 2019-03-20 09:33:21

解决方案3
0 2019-03-20 09:24:17

解决方案4
0 2019-03-20 09:25:02