[英]Pandas - Sum values in list according to index from another list
I am trying to find the most pythonic way to tackle down my problem in the short time as possible since I am dealing with a large amount of data. 我正在努力寻找最快速的方法来在短时间内解决我的问题,因为我正在处理大量数据。 My problem is the following:
我的问题如下:
I have two lists 我有两个清单
a = [12,34,674,2,0,5,6,8]
b = ['foo','bar','bar','foo','foo','bar','foo','foo']
I want to say python: if 'bar' is in b, take all the indexes and sum all values in list a with those indexes. 我想说python:如果'bar'在b中,取所有索引并将列表a中的所有值与这些索引相加。
This is what I have done so far: 这是我到目前为止所做的:
idx = [i for i, j in enumerate(a) if j == 'bar']
but then I am stacked. 但后来我堆积了。 I am considering using some wired for loops.
我正在考虑使用一些有线for循环。 Do you have any idea?
你有什么主意吗?
With numpy
: 随着
numpy
:
import numpy as np
a = np.array(a)
b = np.array(b)
a[b == 'bar'].sum()
Using np.bincount
. 使用
np.bincount
。 Computes both sums ('foo' and 'bar'). 计算两个总和('foo'和'bar')。
sum_foo, sum_bar = np.bincount(np.char.equal(b, 'bar'), a)
sum_foo
# 28.0
sum_bar
# 713.0
Note np.char.equal
works on both lists and arrays. 注意
np.char.equal
适用于列表和数组。 If b is an array, then b == 'bar'
can be used instead and is a bit faster. 如果b是一个数组,那么可以使用
b == 'bar'
代替并且更快一些。
Timings: 时序:
Even though this computes both sums it is actually pretty fast: 即使这计算两个总和,它实际上非常快:
timeit(lambda: np.bincount(b == 'bar', a))
# 2.406161994993454
Compare for example with the numpy masking method: 比较例如numpy masking方法:
timeit(lambda: a[b == 'bar'].sum())
# 5.642918559984537
On larger arrays masking becomes slightly faster which is expected since bincount
does essentially 2x the work. 在较大的阵列上,掩蔽变得稍快,这是预期的,因为
bincount
基本上是工作的2 bincount
。 Still bincount
takes less than 2x the time, so if you happen to need both sums ('foo' and 'bar'), bincount
is still faster. 仍然
bincount
时间不到2倍,所以如果你碰巧需要两个总和('foo'和'bar'), bincount
仍然更快。
aa = np.repeat(a, 1000)
bb = np.repeat(b, 1000)
timeit(lambda: aa[bb == 'bar'].sum(), number=1000)
# 0.07860603698645718
timeit(lambda:np.bincount(bb == 'bar', aa), number=1000)
# 0.11229897901648656
This is simple to do in pandas
: 这在
pandas
很简单:
In[5]:
import pandas as pd
a = [12,34,674,2,0,5,6,8]
b = ['foo','bar','bar','foo','foo','bar','foo','foo']
df = pd.DataFrame({'a':a, 'b':b})
df
Out[5]:
a b
0 12 foo
1 34 bar
2 674 bar
3 2 foo
4 0 foo
5 5 bar
6 6 foo
7 8 foo
In [8]: df.loc[df['b']=='bar','a'].sum()
Out[8]: 713
So here we take your lists and construct a dict
in place for the data
arg for the DataFrame
ctor: 所以在这里我们采用你的列表并为
DataFrame
ctor的data
arg构建一个dict
:
df = pd.DataFrame({'a':a, 'b':b})
Then we just mask the df using loc
where we select the rows where 'b' == 'bar'
and select the column 'a'
and call sum()
: 然后我们使用
loc
掩盖df,我们选择'b' == 'bar'
并选择列'a'
并调用sum()
:
df.loc[df['b']=='bar','a'].sum()
Use: 采用:
l = [x for x,y in zip(a,b) if y == 'bar']
If you want indexes: 如果你想要索引:
l = [i for (i,x),y in zip(enumerate(a),b) if y == 'bar']
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.