[英]Vectorized implementation of a function in pandas
This is my current function: 这是我当前的功能:
def partnerTransaction(main_df, ptn_code, intent, retail_unique):
if intent == 'Frequency':
return main_df.query('csp_code == @retail_unique & partner_code == @ptn_code')['tx_amount'].count()
elif intent == 'Total_value':
return main_df.query('csp_code == @retail_unique & partner_code == @ptn_code')['tx_amount'].sum()
What it does is that it accepts a Pandas DataFrame (DF 1) and three search parameters. 它的作用是接受Pandas DataFrame(DF 1)和三个搜索参数。 The retail_unique is a string that is from another dataframe (DF 2).
retail_unique是来自另一个数据帧(DF 2)的字符串。 Currently, I iterate over the rows of DF 2 using itertuples and call around 200 such functions and write to a 3rd DF, this is just an example.
目前,我使用itertuples对DF 2的行进行迭代,并调用200个此类函数并写入第三个DF,这只是一个示例。 I have around 16000 rows in DF 2 so its very slow.
我在DF 2中大约有16000行,所以它非常慢。 What I want to do is vectorize this function.
我要做的是将此功能向量化。 I want it to return a pandas series which has count of tx_amount per retail unique.
我希望它返回一个熊猫系列,每个零售唯一的数量为tx_amount。 So the series would be
所以系列会是
34 # retail a
54 # retail b
23 # retail c
I would then map this series to the 3rd DF. 然后,我将该系列映射到第3 DF。
Is there any idea on how I might approach this? 有什么想法我该如何处理吗?
EDIT: The first DF contains time based data with each retail appearing multiple times in one column and the tx_amount in another column, like so 编辑:第一个DF包含基于时间的数据,每个零售在一个列中出现多次,而在另一列中显示tx_amount,如下所示
Retail tx_amount
retail_a 50
retail_b 100
retail_a 70
retail_c 20
retail_a 10
The second DF is arranged per retailer: 每个零售商都安排了第二个DF:
Retail
retail_a
retail_b
retail_c
First use merge
with left join . 首先使用左
merge
。
Then groupby
by column tx_amount
and aggregate by agg
functions size
and sum
together or in second solution separately. 然后
groupby
通过柱tx_amount
和聚集由agg
函数size
和sum
分别或一起在第二溶液中。
Last reset_index
for convert Series
to 2 column DataFrame
: 最后的
reset_index
用于将Series
转换为2列DataFrame
:
If need both output together: 如果需要两者一起输出:
def partnerTransaction_together(df1, df2):
df = pd.merge(df1, df2, on='Retail', how='left')
d = {'size':'Frequency','sum':'Total_value'}
return df.groupby('Retail')['tx_amount'].agg(['size','sum']).rename(columns=d)
print (partnerTransaction_together(df1, df2))
Frequency Total_value
Retail
retail_a 3 130
retail_b 1 100
retail_c 1 20
But if need use conditions: 但如果需要使用条件:
def partnerTransaction(df1, df2, intent):
df = pd.merge(df1, df2, on='Retail', how='left')
g = df.groupby('Retail')['tx_amount']
if intent == 'Frequency':
return g.size().reset_index(name='Frequency')
elif intent == 'Total_value':
return g.sum().reset_index(name='Total_value')
print (partnerTransaction(df1, df2, 'Frequency'))
Retail Frequency
0 retail_a 3
1 retail_b 1
2 retail_c 1
print (partnerTransaction(df1, df2, 'Total_value'))
Retail Total_value
0 retail_a 130
1 retail_b 100
2 retail_c 20
If you want speed, here is a numpy
solution using bincount
如果你想要的速度,这里是一个
numpy
使用解决方案bincount
from collections import OrderedDict
f, u = pd.factorize(df1.Retail.values)
c = np.bincount(f)
s = np.bincount(f, df1.tx_amount.values).astype(df1.tx_amount.dtype)
pd.DataFrame(OrderedDict(Frequency=c, Total_value=s), u)
Frequency Total_value
retail_a 3 130
retail_b 1 100
retail_c 1 20
Timing 定时
df1 = pd.DataFrame(dict(
Retail=np.random.choice(list('abcdefghijklmnopqrstuvwxyz'), 10000),
tx_amount=np.random.randint(1000, size=10000)
))
%%timeit
f, u = pd.factorize(df1.Retail.values)
c = np.bincount(f)
s = np.bincount(f, df1.tx_amount.values).astype(df1.tx_amount.dtype)
pd.DataFrame(OrderedDict(Frequency=c, Total_value=s), u)
1000 loops, best of 3: 607 µs per loop
%%timeit
d = {'size':'Frequency','sum':'Total_value'}
df1.groupby('Retail')['tx_amount'].agg(['size','sum']).rename(columns=d)
1000 loops, best of 3: 1.53 ms per loop
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.