简体   繁体   English

Pandas - 根据另一个列表中的索引对列表中的值求和

[英]Pandas - Sum values in list according to index from another list

I am trying to find the most pythonic way to tackle down my problem in the short time as possible since I am dealing with a large amount of data. 我正在努力寻找最快速的方法来在短时间内解决我的问题,因为我正在处理大量数据。 My problem is the following: 我的问题如下:

I have two lists 我有两个清单

a = [12,34,674,2,0,5,6,8]
b = ['foo','bar','bar','foo','foo','bar','foo','foo']

I want to say python: if 'bar' is in b, take all the indexes and sum all values in list a with those indexes. 我想说python:如果'bar'在b中,取所有索引并将列表a中的所有值与这些索引相加。

This is what I have done so far: 这是我到目前为止所做的:

idx = [i for i, j in enumerate(a) if j == 'bar'] 

but then I am stacked. 但后来我堆积了。 I am considering using some wired for loops. 我正在考虑使用一些有线for循环。 Do you have any idea? 你有什么主意吗?

With numpy : 随着numpy

import numpy as np

a = np.array(a)
b = np.array(b)

a[b == 'bar'].sum()

Using np.bincount . 使用np.bincount Computes both sums ('foo' and 'bar'). 计算两个总和('foo'和'bar')。

sum_foo, sum_bar = np.bincount(np.char.equal(b, 'bar'), a)
sum_foo
# 28.0
sum_bar
# 713.0

Note np.char.equal works on both lists and arrays. 注意np.char.equal适用于列表和数组。 If b is an array, then b == 'bar' can be used instead and is a bit faster. 如果b是一个数组,那么可以使用b == 'bar'代替并且更快一些。

Timings: 时序:

Even though this computes both sums it is actually pretty fast: 即使这计算两个总和,它实际上非常快:

timeit(lambda: np.bincount(b == 'bar', a))
# 2.406161994993454

Compare for example with the numpy masking method: 比较例如numpy masking方法:

timeit(lambda: a[b == 'bar'].sum())
# 5.642918559984537

On larger arrays masking becomes slightly faster which is expected since bincount does essentially 2x the work. 在较大的阵列上,掩蔽变得稍快,这是预期的,因为bincount基本上是工作的2 bincount Still bincount takes less than 2x the time, so if you happen to need both sums ('foo' and 'bar'), bincount is still faster. 仍然bincount时间不到2倍,所以如果你碰巧需要两个总和('foo'和'bar'), bincount仍然更快。

aa = np.repeat(a, 1000)
bb = np.repeat(b, 1000)
timeit(lambda: aa[bb == 'bar'].sum(), number=1000)
# 0.07860603698645718
timeit(lambda:np.bincount(bb == 'bar', aa), number=1000)
# 0.11229897901648656

This is simple to do in pandas : 这在pandas很简单:

In[5]:
import pandas as pd
a = [12,34,674,2,0,5,6,8]
b = ['foo','bar','bar','foo','foo','bar','foo','foo']
df = pd.DataFrame({'a':a, 'b':b})
df

Out[5]: 
     a    b
0   12  foo
1   34  bar
2  674  bar
3    2  foo
4    0  foo
5    5  bar
6    6  foo
7    8  foo

In [8]: df.loc[df['b']=='bar','a'].sum()
Out[8]: 713

So here we take your lists and construct a dict in place for the data arg for the DataFrame ctor: 所以在这里我们采用你的列表并为DataFrame ctor的data arg构建一个dict

df = pd.DataFrame({'a':a, 'b':b})

Then we just mask the df using loc where we select the rows where 'b' == 'bar' and select the column 'a' and call sum() : 然后我们使用loc掩盖df,我们选择'b' == 'bar'并选择列'a'并调用sum()

df.loc[df['b']=='bar','a'].sum()

Use: 采用:

l = [x for x,y in zip(a,b) if y == 'bar']

If you want indexes: 如果你想要索引:

l = [i for (i,x),y in zip(enumerate(a),b) if y == 'bar']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM