[英]Speeding up the python performance
adding 100k raw data and plot drawn by suggested code I am having below code, analyzing a 100K rows data, it takes 3 minutes for the output to be shown. 添加由建议代码绘制的 100k 原始数据和 plot我有以下代码,分析 100K 行数据,显示 output 需要 3 分钟。 the problem is with the for loops, where program needs to check two indicators and later act based upon that.问题出在 for 循环中,程序需要检查两个指标,然后根据它采取行动。 data is a bourse buy/sell/na records and i want to draw buy vs sell and so on.数据是交易所的买入/卖出/na 记录,我想绘制买入与卖出等等。
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv('gco.csv', encoding ='utf-16 LE')
x = data.index.size
data['Money'] = data['Last']*data['Volume']
data['Date'] = data['Time']
# Creating date column
data['Date'] = data['Date'].map(lambda x: x[0:10])
# Creating a dedicated database
my_df = pd.DataFrame(columns =['Buy','Sell','NA'])
#calculate the Buy column
avai_dates = pd.unique(data.Date)
y = len(avai_dates)
my_df = pd.DataFrame(index=np.arange(0, y), columns =['Buy','Sell','NA'])
my_df[:]=0
for j in range(y):
for i in range(x):
if data.Date[i] == avai_dates[j] and data.Type[i] == 'Buy':
my_df.Buy[j] += data.Money[i]
for i in range(x):
if data.Date[i] == avai_dates[j] and data.Type[i] == 'Sell':
my_df.Sell[j] += data.Money[i]
for i in range(x):
if data.Date[i] == avai_dates[j] and data.Type[i] == 'Buy/Sell':
my_df.NA[j] += data.Money[i]
new_df = my_df[(my_df.T != 0).any()]
z = len(new_df)
xm = np.arange(0, z)
plt.plot(xm, new_df.Buy, 'green')
plt.plot(xm, new_df.Sell, 'red')
plt.plot(xm, new_df.NA, 'yellow')
plt.xlabel('Dates', fontsize = 15)
plt.ylabel('Money Volumes', fontsize = 15)
plt.title('Buy vs. Sell Vs. NA')
plt.grid()
plt.show()
ax = plt.subplot(111)
ax.bar(xm-0.2, new_df.Buy, width = 0.2, color = 'g')
ax.bar(xm,new_df.Sell, width = 0.2 , color = 'r')
ax.bar(xm+0.2,new_df.NA, width = 0.2, color = 'y')
What you are doing is grouping the data points by Time
and the Type
and aggregating them.您正在做的是按Time
和Type
对数据点进行分组并聚合它们。 Pandas has build in functions for doing this. Pandas 具有执行此操作的内置功能。
You can replace all this code:您可以替换所有这些代码:
# Creating a dedicated database
my_df = pd.DataFrame(columns =['Buy','Sell','NA'])
#calculate the Buy column
avai_dates = pd.unique(data.Date)
y = len(avai_dates)
my_df = pd.DataFrame(index=np.arange(0, y), columns =['Buy','Sell','NA'])
my_df[:]=0
for j in range(y):
for i in range(x):
if data.Date[i] == avai_dates[j] and data.Type[i] == 'Buy':
my_df.Buy[j] += data.Money[i]
for i in range(x):
if data.Date[i] == avai_dates[j] and data.Type[i] == 'Sell':
my_df.Sell[j] += data.Money[i]
for i in range(x):
if data.Date[i] == avai_dates[j] and data.Type[i] == 'Buy/Sell':
my_df.NA[j] += data.Money[i]
new_df = my_df[(my_df.T != 0).any()]
With this statement:有了这个声明:
new_df = data.groupby(["Time", "Type"]).agg({'Money':['sum']})["Money","sum"].unstack(fill_value=0)
data.groupby(["Time", "Type"])
This hierarchically groups the data by the Time
and then by Type
.这将按Time
和Type
对数据进行分层分组。 For more information on this, check out the DataFrame.groupby()
documentation.有关这方面的更多信息,请查看DataFrame.groupby()
文档。
.agg({'Money':['sum']})
This aggregates the Money values in each group by summing it up.这通过汇总来汇总每个组中的货币值。 You could just use .agg('sum')
but this would also aggregate the values of 'Last' and 'Volume'您可以只使用.agg('sum')
但这也会聚合 'Last' 和 'Volume' 的值
["Money","sum"]
Then we just unpack the columns to get to the raw sum.然后我们只需解压缩列即可得到原始总和。 This gives you almost the result, however it has the Type
group stacked:这几乎为您提供了结果,但是它堆叠了Type
组:
Time Type
2020:12:12 Buy 1000
Sell 1000
2020:12:13 Buy 4400
Sell 2200
2020:12:14 Sell 4680
2020:12:15 Buy 2860
Sell 1430
2020:12:16 Buy/Sell 6400
2020:12:17 Buy 7140
2020:12:18 Buy/Sell 770
2020:12:19 Buy 810
Sell 1620
2020:12:20 Buy 2400
Sell 1200
2020:12:21 Buy 1210
2020:12:22 Buy 1200
Sell 1200
Name: (Money, sum), dtype: int64
You can now use the final unstacking function for getting the final dataframe.您现在可以使用最终的拆垛 function 来获得最终的 dataframe。 By setting fill_value=0
you ensure that the undefined values are set to 0
instead of nan
通过设置fill_value=0
可以确保未定义的值设置为0
而不是nan
.unstack(fill_value=0)
I created some toy data based on the little info you have provided and running it through the one-liner, this is what you get.我根据您提供的少量信息创建了一些玩具数据,并通过单线运行它,这就是您得到的。
Type Buy Buy/Sell Sell
Time
2020:12:12 1000 0 1000
2020:12:13 4400 0 2200
2020:12:14 0 0 4680
2020:12:15 2860 0 1430
2020:12:16 0 6400 0
2020:12:17 7140 0 0
2020:12:18 0 770 0
2020:12:19 810 0 1620
2020:12:20 2400 0 1200
2020:12:21 1210 0 0
2020:12:22 1200 0 1200
It is basically almost identical to the original new_df
you have computed, except that it keeps the Time
values as index and Buy/Sell Type is labeled Buy/Sell
instead of NA
.它基本上与您计算的原始new_df
几乎相同,除了它将Time
值保留为索引并且 Buy/Sell Type 标记为Buy/Sell
而不是NA
。 Of course you can drop the Time
column and rename the Buy/Sell
if you so wish by appending this to the one-liner:当然,如果您愿意,可以删除Time
列并重命名Buy/Sell
,方法是将其附加到单行中:
.reset_index().drop("Time",axis=1).rename(columns={"Buy/Sell":"NA"})
Let me know in the comments if this provides any speedup.如果这提供了任何加速,请在评论中告诉我。 If your data frame is really large, you might have to resort to other techniques such as analyzing the data in batches, parallel processing or custom numba
processing.如果您的数据框非常大,您可能不得不求助于其他技术,例如批量分析数据、并行处理或自定义numba
处理。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.