加快 python 性能

Question

adding 100k raw data and plot drawn by suggested code I am having below code, analyzing a 100K rows data, it takes 3 minutes for the output to be shown. 添加由建议代码绘制的 100k 原始数据和 plot我有以下代码，分析 100K 行数据，显示 output 需要 3 分钟。 the problem is with the for loops, where program needs to check two indicators and later act based upon that.问题出在 for 循环中，程序需要检查两个指标，然后根据它采取行动。 data is a bourse buy/sell/na records and i want to draw buy vs sell and so on.数据是交易所的买入/卖出/na 记录，我想绘制买入与卖出等等。

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt


data = pd.read_csv('gco.csv', encoding ='utf-16 LE')
x = data.index.size
data['Money'] = data['Last']*data['Volume']
data['Date'] = data['Time']

# Creating date column
data['Date'] = data['Date'].map(lambda x: x[0:10])


# Creating a dedicated database
my_df = pd.DataFrame(columns =['Buy','Sell','NA'])

#calculate the Buy column
avai_dates = pd.unique(data.Date)
y = len(avai_dates)
my_df = pd.DataFrame(index=np.arange(0, y), columns =['Buy','Sell','NA'])
my_df[:]=0

for j in range(y):
    for i in range(x):
        if data.Date[i] == avai_dates[j] and data.Type[i] == 'Buy':
            my_df.Buy[j] += data.Money[i]
    for i in range(x):
        if data.Date[i] == avai_dates[j] and data.Type[i] == 'Sell':
            my_df.Sell[j] += data.Money[i]
    for i in range(x):
        if data.Date[i] == avai_dates[j] and data.Type[i] == 'Buy/Sell':
            my_df.NA[j] += data.Money[i]

new_df = my_df[(my_df.T != 0).any()]
z = len(new_df)
xm = np.arange(0, z)

plt.plot(xm, new_df.Buy, 'green')
plt.plot(xm, new_df.Sell, 'red')
plt.plot(xm, new_df.NA, 'yellow')
plt.xlabel('Dates', fontsize = 15)
plt.ylabel('Money Volumes', fontsize = 15)
plt.title('Buy vs. Sell Vs. NA')
plt.grid()
plt.show()

ax = plt.subplot(111)
ax.bar(xm-0.2, new_df.Buy, width = 0.2, color = 'g')
ax.bar(xm,new_df.Sell, width = 0.2 , color = 'r')
ax.bar(xm+0.2,new_df.NA, width = 0.2, color = 'y')

Answer 1

What you are doing is grouping the data points by Time and the Type and aggregating them.您正在做的是按Time和Type对数据点进行分组并聚合它们。 Pandas has build in functions for doing this. Pandas 具有执行此操作的内置功能。

You can replace all this code:您可以替换所有这些代码：

# Creating a dedicated database
my_df = pd.DataFrame(columns =['Buy','Sell','NA'])

#calculate the Buy column
avai_dates = pd.unique(data.Date)
y = len(avai_dates)
my_df = pd.DataFrame(index=np.arange(0, y), columns =['Buy','Sell','NA'])
my_df[:]=0

for j in range(y):
    for i in range(x):
        if data.Date[i] == avai_dates[j] and data.Type[i] == 'Buy':
            my_df.Buy[j] += data.Money[i]
    for i in range(x):
        if data.Date[i] == avai_dates[j] and data.Type[i] == 'Sell':
            my_df.Sell[j] += data.Money[i]
    for i in range(x):
        if data.Date[i] == avai_dates[j] and data.Type[i] == 'Buy/Sell':
            my_df.NA[j] += data.Money[i]

new_df = my_df[(my_df.T != 0).any()]

With this statement:有了这个声明：

new_df = data.groupby(["Time", "Type"]).agg({'Money':['sum']})["Money","sum"].unstack(fill_value=0)

Breaking it down打破它

data.groupby(["Time", "Type"])

This hierarchically groups the data by the Time and then by Type .这将按Time和Type对数据进行分层分组。 For more information on this, check out the DataFrame.groupby() documentation.有关这方面的更多信息，请查看DataFrame.groupby()文档。

.agg({'Money':['sum']})

This aggregates the Money values in each group by summing it up.这通过汇总来汇总每个组中的货币值。 You could just use .agg('sum') but this would also aggregate the values of 'Last' and 'Volume'您可以只使用.agg('sum')但这也会聚合 'Last' 和 'Volume' 的值

["Money","sum"]

Then we just unpack the columns to get to the raw sum.然后我们只需解压缩列即可得到原始总和。 This gives you almost the result, however it has the Type group stacked:这几乎为您提供了结果，但是它堆叠了Type组：

Time        Type    
2020:12:12  Buy         1000
            Sell        1000
2020:12:13  Buy         4400
            Sell        2200
2020:12:14  Sell        4680
2020:12:15  Buy         2860
            Sell        1430
2020:12:16  Buy/Sell    6400
2020:12:17  Buy         7140
2020:12:18  Buy/Sell     770
2020:12:19  Buy          810
            Sell        1620
2020:12:20  Buy         2400
            Sell        1200
2020:12:21  Buy         1210
2020:12:22  Buy         1200
            Sell        1200
Name: (Money, sum), dtype: int64

You can now use the final unstacking function for getting the final dataframe.您现在可以使用最终的拆垛 function 来获得最终的 dataframe。 By setting fill_value=0 you ensure that the undefined values are set to 0 instead of nan通过设置fill_value=0可以确保未定义的值设置为0而不是nan

.unstack(fill_value=0)

I created some toy data based on the little info you have provided and running it through the one-liner, this is what you get.我根据您提供的少量信息创建了一些玩具数据，并通过单线运行它，这就是您得到的。

Type         Buy  Buy/Sell  Sell
Time                            
2020:12:12  1000         0  1000
2020:12:13  4400         0  2200
2020:12:14     0         0  4680
2020:12:15  2860         0  1430
2020:12:16     0      6400     0
2020:12:17  7140         0     0
2020:12:18     0       770     0
2020:12:19   810         0  1620
2020:12:20  2400         0  1200
2020:12:21  1210         0     0
2020:12:22  1200         0  1200

It is basically almost identical to the original new_df you have computed, except that it keeps the Time values as index and Buy/Sell Type is labeled Buy/Sell instead of NA .它基本上与您计算的原始new_df几乎相同，除了它将Time值保留为索引并且 Buy/Sell Type 标记为Buy/Sell而不是NA 。 Of course you can drop the Time column and rename the Buy/Sell if you so wish by appending this to the one-liner:当然，如果您愿意，可以删除Time列并重命名Buy/Sell ，方法是将其附加到单行中：

.reset_index().drop("Time",axis=1).rename(columns={"Buy/Sell":"NA"})

Let me know in the comments if this provides any speedup.如果这提供了任何加速，请在评论中告诉我。 If your data frame is really large, you might have to resort to other techniques such as analyzing the data in batches, parallel processing or custom numba processing.如果您的数据框非常大，您可能不得不求助于其他技术，例如批量分析数据、并行处理或自定义numba处理。

加快 python 性能

问题描述

1 个解决方案

解决方案1
1 已采纳 2021-01-11 12:44:26

Breaking it down打破它

加快 python 性能

问题描述

1 个解决方案

解决方案1 1 已采纳 2021-01-11 12:44:26

Breaking it down打破它

解决方案1
1 已采纳 2021-01-11 12:44:26