（Python/numpy.where 循环）代码需要很长时间（帮助想加速代码）

Question

The below code takes about 5 minutes for roughly 110'000 items to be grouped on the hardware that this is running on.下面的代码需要大约 5 分钟才能将大约 110'000 个项目分组到运行它的硬件上。 The for loop seems to be the cause. for循环似乎是原因。 I would appreciate suggestions to speed this up.我将不胜感激建议以加快速度。

def group_data(x_epochs, y_data, grouping):
  x_texts = np.array([dt.datetime.fromtimestamp(i).strftime(grouping) for i in x_epochs], dtype='str')
  unique_x_texts = np.array(sorted(set(x_texts)), dtype='str')
  returned_y_data = np.zeros(np.shape(unique_x_texts))

  for ut in unique_x_texts:
    indices = np.where(x_texts == ut)[0]
    y = y_data[indices[-1]] - y_data[indices[0]]
    if y > 0:
      returned_y_data[np.where(unique_x_texts == ut)[0]] = y

  return unique_x_texts, returned_y_data

The data x_epochs is a linear (monotonically) rising list of unix-epoch time-values.数据x_epochs是一个线性（单调）上升的 unix-epoch 时间值列表。
The y_data are totaliser-meter readings that are typically rising in time but at changing rates. y_data是y_data仪表读数，通常随时间上升但速率不断变化。
The grouping parameter specifies how this data should be grouped, so that a delta per group can be calculated. grouping参数指定应如何对这些数据进行分组，以便可以计算每组的增量。 For example if I want to return a list of hourly deltas I would specify %Y %j %h .例如，如果我想返回每小时增量的列表，我会指定%Y %j %h 。

Are there any options to improve and speed this code up?是否有任何选项可以改进和加速此代码？

EDIT:编辑：
Updated code with thanks to comment by @goodvibration.感谢@goodvibration 的评论更新了代码。 Using enumerate and eliminating one call to np.where :使用enumerate并消除对np.where一次调用：

def group_data(x_epochs, y_data, grouping):
  x_texts = np.array([dt.datetime.fromtimestamp(i).strftime(grouping) for i in x_epochs], dtype='str')
  unique_x_texts = np.array(sorted(set(x_texts)), dtype='str')
  returned_y_data = np.zeros(np.shape(unique_x_texts))

  for idx, ut in enumerate(unique_x_texts):
    indices = np.where(x_texts == ut)[0]
    y = y_data[indices[-1]] - y_data[indices[0]]
    if y > 0:
      returned_y_data[idx] = y

  return unique_x_texts, returned_y_data

Unfortunately, the gain in throughputtime was not very dramatic.不幸的是，吞吐量时间的增加并不是很显着。

EDIT: example of x_epochs :编辑： x_epochs例子：

[1584199800 1584200400 1584201000 1584201600 1584202200 1584202800
 1584203400 1584204000 1584204600 1584205200 1584205800 1584206400
 1584207000 1584207600 1584208200 1584208800 1584209400 1584210000
 1584210600 1584211200 1584211800 1584212400 1584213000 1584213600
 1584214200 1584214800 1584215400 1584216000 1584216600 1584217200
 1584217800 1584218400 1584219000 1584219600 1584220200 1584220800
 1584221400 1584222000 1584222600 1584223200 1584223800 1584224400
 1584225000 1584225600 1584226200 1584226800 1584227400 1584228000
 1584228600 1584229200 1584229800 1584230400 1584231000 1584231600
 1584232200 1584232800 1584233400 1584234000 1584234600 1584235200
 1584235800 1584236400 1584237000 1584237600 1584238200 1584238800
 1584239400 1584240000 1584240600 1584241200 1584241800 1584242400
 1584243000 1584243600 1584244200 1584244800 1584245400 1584246000
 1584246600 1584247200 1584247800 1584248400 1584249000 1584249600
 1584250200 1584250800 1584251400 1584252000 1584252600 1584253200
 1584253800 1584254400 1584255000 1584255600 1584256200 1584256800
 1584257400 1584258000 1584258600 1584259200 1584259800 1584260400
 1584261000 1584261600 1584262200 1584262800 1584263400 1584264000
 1584264600 1584265200 1584265800 1584266400 1584267000 1584267600
 1584268200 1584268800 1584269400 1584270000 1584270600 1584271200
 1584271800 1584272400 1584273000 1584273600 1584274200 1584274800
 1584275400 1584276000 1584276600 1584277200 1584277800 1584278400
 1584279000 1584279600 1584280200 1584280800 1584281400 1584282000
 1584282600 1584283200 1584283800 1584284400 1584285000 1584285600
 1584286200 1584286800 1584287400 1584288000 1584288600 1584289200
 1584289800 1584290400 1584291000 1584291600 1584292200 1584292800
 1584293400 1584294000 1584294600 1584295200 1584295800 1584296400
 1584297000 1584297600 1584298200 1584298800 1584299400 1584300000
 1584300600 1584301200 1584301800 1584302400 1584303000 1584303600
 1584304200 1584304800 1584305400 1584306000 1584306600 1584307200
 1584307800 1584308400 1584309000 1584309600 1584310200 1584310800
 1584311400 1584312000 1584312600 1584313200 1584313800 1584314400
 1584315000 1584315600 1584316200 1584316800 1584317400 1584318000
 1584318600 1584319200 1584319800 1584320400 1584321000 1584321600
 1584322200 1584322800 1584323400 1584324000 1584324600 1584325200
 1584325800 1584326400 1584327000 1584327600 1584328200 1584328800
 1584329400 1584330000 1584330600 1584331200 1584331800 1584332400
 1584333000 1584333600 1584334200 1584334800 1584335400 1584336000
 1584336600 1584337200 1584337800 1584338400 1584339000 1584339600
 1584340200 1584340800 1584341400 1584342000 1584342600 1584343200
 1584343800 1584344400 1584345000 1584345600 1584346200 1584346800
 1584347400 1584348000 1584348600 1584349200 1584349800 1584350400
 1584351000 1584351600 1584352200 1584352800 1584353400 1584354000
 1584354600 1584355200 1584355800 1584356400 1584357000 1584357600
 1584358200 1584358800 1584359400 1584360000 1584360600 1584361200
 1584361800 1584362400 1584363000 1584363600 1584364200 1584364800
 1584365400 1584366000 1584366600 1584367200 1584367800 1584368400
 1584369000 1584369600 1584370200 1584370800 1584371400 1584372000
 1584372600 1584373200 1584373800 1584374400 1584375000 1584375600
 1584376200 1584376800 1584377400 1584378000 1584378600 1584379200
 1584379800]

example of y_data : y_data示例：

[54990.         54990.         54990.         54990.
 54990.         54990.         54990.         54990.
 54990.         54990.         54990.         54990.
 54990.         54990.         54990.         54990.
 54990.         54990.         54990.         54990.
 54990.         54990.         54990.         54990.
 54990.         54990.         54990.         54990.
 54990.         54990.         54990.         54990.
 54990.         54990.         54990.         54990.
 54990.         54990.         54990.         54990.
 54990.         54990.         54990.         54990.
 54990.         54990.         54990.         54990.
 54990.         54990.         54990.         54990.
 54990.         54990.         54990.         54990.
 54990.         54990.         54990.         54990.
 54990.         54990.         54990.         54990.
 54990.         54990.         54990.         54990.
 54990.         54990.         54990.         54990.
 54990.         54990.         54990.         54990.
 54990.         54990.         54990.         54990.
 54990.         54990.         54990.         54990.
 54990.         54990.         54990.         54990.
 54990.         54990.         54990.         54990.
 54990.         54990.         54990.         54990.
 54990.         54990.         54990.         54990.
 54990.         54990.         54990.         54990.
 54990.         54990.         54990.         54990.
 54990.         54990.         54990.         54990.
 54990.         54990.         54990.         54990.
 54990.         54990.         54990.         54990.
 54990.         54990.         54990.         54990.
 54990.         54990.         54990.         54990.
 54990.         54990.         54990.         54990.
 54990.         54990.         54990.         54990.
 54990.         54990.         54990.         54990.
 54990.         54990.         54990.         54990.
 54990.         54990.         54990.         54990.
 54990.         54990.         54990.         54990.
 54990.         54990.         54990.         54990.
 54990.         54990.         54990.         54990.
 54990.         54990.         54990.         54990.
 54990.         54990.         54990.         54990.
 54990.         54990.         54990.         54990.
 54990.         54990.         54990.         54990.
 54990.         54990.         54990.         54990.
 54990.         54990.         54990.         54990.
 54990.         54990.         54990.         54990.
 54990.         54990.         54990.         54990.
 54990.         54990.         54990.         54990.
 54990.         54990.         54990.         54990.
 54990.         54990.         54990.         54990.
 54990.         54990.         54990.         54990.
 54990.         54990.         54990.         54990.
 54990.         54990.         54990.         54990.
 54990.         54990.         54990.         54990.
 54990.         54990.         54990.         54990.
 54990.         54990.         54990.         54990.
 54990.         54990.         54990.         54990.
 54990.         54990.         54990.9718543  55013.61214165
 55046.23509934 55092.59933775 55120.49915683 55232.16887417
 55396.74668874 55587.17537943 55794.46357616 56039.78807947
 56228.94435076 56341.84768212 56392.19055649 56484.82119205
 56505.43377483 56509.92580101 56544.1307947  56624.4205298
 56788.66104553 56901.03311258 56986.18543046 57053.55311973
 57106.50827815 57228.92580101 57307.19205298 57373.13930348
 57426.03703704 57505.9884106  57587.55223881 57766.36363636
 57932.57615894 58124.88723051 58207.29292929 58294.66500829
 58392.36700337 58498.51986755 58501.         58653.12962963
 58951.96517413 59136.54635762 59255.65656566 59324.30845771
 59326.         59346.57504216 59400.54304636 59470.82154882
 59530.70646766 59575.73344371 59609.38888889 59645.1641791
 59678.98675497 59705.51770658 59733.30463576 59747.55960265
 59775.43338954 59783.78807947 59784.         59784.
 59784.         59784.         59784.         59784.97019868
 59785.         59785.         59785.         59785.
 59785.        ]

using grouping = '%d %Hh' results in unique_x_texts使用grouping = '%d %Hh'导致unique_x_texts

['14 16h' '14 17h' '14 18h' '14 19h' '14 20h' '14 21h' '14 22h' '14 23h'
 '15 00h' '15 01h' '15 02h' '15 03h' '15 04h' '15 05h' '15 06h' '15 07h'
 '15 08h' '15 09h' '15 10h' '15 11h' '15 12h' '15 13h' '15 14h' '15 15h'
 '15 16h' '15 17h' '15 18h' '15 19h' '15 20h' '15 21h' '15 22h' '15 23h'
 '16 00h' '16 01h' '16 02h' '16 03h' '16 04h' '16 05h' '16 06h' '16 07h'
 '16 08h' '16 09h' '16 10h' '16 11h' '16 12h' '16 13h' '16 14h' '16 15h'
 '16 16h' '16 17h' '16 18h']

and returned_y_data :并returned_y_data ：

[  0.           0.           0.           0.           0.
   0.           0.           0.           0.           0.
   0.           0.           0.           0.           0.
   0.           0.           0.           0.           0.
   0.           0.           0.           0.           0.
   0.           0.           0.           0.           0.
   0.           0.           0.           0.           0.
   0.           0.           0.           0.          56.23509934
 701.86423841 465.64569536 476.25962945 372.48391731 701.3045187
 657.30016584 263.99668874 208.16520615  78.48229342   1.
   0.        ]

Answer 1

Tested on your data.对您的数据进行了测试。 Try this:尝试这个：

def fast_groupie(x_epochs, y_data, grouping):
  y_data = np.array(y_data)
  x_texts = np.array([dt.datetime.fromtimestamp(i).strftime(grouping) for i in x_epochs], dtype='str')

  unique_x_texts, loc1 = np.unique(x_texts, return_index = True)
  loc2 =  len(x_texts)-1-np.unique(np.flip(x_texts), return_index=True)[1]

  y = y_data[loc2]-y_data[loc1]
  returned_y_data = np.where(y>0, y, 0)

  return unique_x_texts, returned_y_data

The function you provided found unique_x_texts first, which is a full traversal of x_texts.你提供的函数首先找到unique_x_texts，是x_texts的全遍历。 Then if there are M items in unique_x_texts, it calls np.where M times, which is another M traversals of x_texts.那么如果unique_x_texts中有M个item，则调用np.where M次，也就是x_texts的另外M次遍历。 The more unique items there are, the longer this will take.独特的物品越多，这需要的时间就越长。

The function above goes through x_texts only twice;上面的函数只通过 x_texts 两次； it is independent of M, and should thus be quite a bit faster.它独立于 M，因此应该快得多。

Answer 2

Above all, try to avoid for loops, and replace them with vectorized functions or other map-like functions.最重要的是，尽量避免 for 循环，并用矢量化函数或其他类似地图的函数替换它们。 In order words, perform operations with vectors and matrices (arrays).换句话说，对向量和矩阵（数组）执行操作。 @Mercury provided a good illustration of this. @Mercury 很好地说明了这一点。 Numpy, as well as Scipy and other Python libraries for ML have functions which are optimized and wrote precisely for huge datasets. Numpy 以及 Scipy 和其他用于机器学习的 Python 库具有针对大型数据集进行优化和精确编写的函数。 Use them.使用它们。

（Python/numpy.where 循环）代码需要很长时间（帮助想加速代码）

问题描述

2 个解决方案

解决方案1
1 已采纳 2020-03-19 12:17:02

解决方案2
1 2020-03-19 16:02:15

（Python/numpy.where 循环）代码需要很长时间（帮助想加速代码）

问题描述

2 个解决方案

解决方案1 1 已采纳 2020-03-19 12:17:02

解决方案2 1 2020-03-19 16:02:15

解决方案1
1 已采纳 2020-03-19 12:17:02

解决方案2
1 2020-03-19 16:02:15