在 pyspark 中同时运行 for 循环，而不是按顺序运行

Question

Below there is a for loop execution I am running on a Databricks cluster:下面是我在 Databricks 集群上运行的for 循环执行：

datalake_spark_dataframe_downsampled = pd.DataFrame( 
                           {'IMEI' : ['001', '001', '001', '001', '001', '002', '002'],
                            'OuterSensorConnected':[0, 0, 0, 1, 0, 0, 0], 
                            'OuterHumidity':[31.784826, 32.784826, 33.784826, 43.784826, 23.784826, 54.784826, 31.784826],
                            'EnergyConsumption': [70, 70, 70, 70, 70, 70, 70],
                            'DaysDeploymentDate': [0, 0, 1, 1, 1, 1, 1],
                            'label': [0, 0, 1, 1, 0, 0, ]}
                           )
datalake_spark_dataframe_downsampled = spark.createDataFrame(datalake_spark_dataframe_downsampled )

# printSchema of the datalake_spark_dataframe_downsampled (spark df):

"root
 |-- IMEI: string (nullable = true)
 |-- OuterSensorConnected: integer (nullable = false)
 |-- OuterHumidity: float (nullable = true)
 |-- EnergyConsumption: float (nullable = true)
 |-- DaysDeploymentDate: integer (nullable = true)
 |-- label: integer (nullable = false)"

device_ids=datalake_spark_dataframe_downsampled.select(sql_function.collect_set('IMEI').alias('unique_IMEIS')).collect()[0]['unique_IMEIS']

print(device_ids) #["001", "002", ..."030"] 30 ids

for i in device_ids:

  #filtered_dataset=datalake_spark_dataframe_downsampled.where(datalake_spark_dataframe_downsampled.IMEI.isin([i])) 
  #The above operation is executed inside the function training_models_operation_testing()

  try:
      training_models_operation_testing(i, datalake_spark_dataframe_downsampled, drop_columns_not_used_in_training, training_split_ratio_value, testing_split_ratio_value, mlflow_folder, cross_validation_rounds_value, features_column_name, optimization_metric_value, pretrained_models_T_minus_one, folder_name_T_minus_one, timestamp_snap, instrumentation_key_value, canditate_asset_ids, executor, device_ids)

  except Exception as e:
      custom_logging_function("ERROR", instrumentation_key_value, "ERROR EXCEPTION: {0}".format(e))

For the sake of the problem I have attached a sample data to have a general idea of how my data is..And imagine that many more rows and IDs exist.为了解决这个问题，我附上了一个示例数据，以大致了解我的数据是怎样的......并想象存在更多的行和 ID。 I have just created a few only for demonstration我刚刚创建了一些仅用于演示

As you can see this is a simple function call inside a for loop in a Databricks cluster running with pyspark.正如您所看到的，这是在使用 pyspark 运行的 Databricks 集群中的 for 循环内的简单 function 调用。

Briefly, I first create a list of the unique ids (IMEI column) existing in my dataset.简而言之，我首先创建一个存在于我的数据集中的唯一 ID（IMEI 列）的列表。 This is equal to 30. Thus, I am running 30 iterations with the for loop.这等于 30。因此，我使用 for 循环运行 30 次迭代。 In each iteration I am executing the following steps:在每次迭代中，我都执行以下步骤：

Filter the rows of the datalake_spark_dataframe_downsampled (spark df) matching to each of the 30 asset ids.过滤与 30 个资产 ID 中的每一个匹配的 datalake_spark_dataframe_downsampled (spark df) 行。 So let's say for example that out of the 40,000 rows of the initial df only 140 correspond to the first device id.因此，例如，在初始 df 的 40,000 行中，只有 140 行对应于第一个设备 ID。
Based on those 140 rows (filtered_dataset) the function does preprocessing , train-test-split and trains two Spark ML algorithms only for the rows of the filtered dataset.基于这 140 行 (filtered_dataset)，function 仅针对过滤数据集的行进行预处理、训练-测试-拆分和训练两个Spark ML算法。

The code snippet attached is working successfully.附加的代码片段正在成功运行。 Although the for loop is executed sequentially, one iteration at a time.虽然for 循环是按顺序执行的，但一次迭代一次。 The function is called for the first id and only after completes it goes to the next id. function 为第一个 id 调用，只有在完成后才转到下一个 id。 However, what I want is to transform the above for loop in a way that the 30 iterations will run concurrently in pyspark and NOT one-by-one .但是，我想要的是转换上面的 for 循环，使 30 次迭代将在pyspark中同时运行，而不是one-by-one 。 How could I achieve this in pyspark?我如何在 pyspark 中实现这一点？

I am open to discussion and ideas testing, because I understand that what I am asking may not be so simple to be executed in a Spark environment.我对讨论和想法测试持开放态度，因为我知道我所要求的可能并不那么简单，无法在 Spark 环境中执行。

My current output in logging (this is something I print the way below)我当前的 output 正在记录中（这是我在下面打印的内容）

Iteration 1迭代 1
Starting execution...开始执行...
- Executing the function for id 001 - 为 id 001 执行 function
Finished execution...执行完毕...

Iteration 2迭代 2
Starting execution...开始执行...
- Executing the function for id 002 - 为 id 002 执行 function
Finished execution...执行完毕...

My desired output in logging (this is something I print the way below)我在日志记录中想要的 output （这是我在下面打印的内容）

Starting execution...开始执行...
- Executing the function for id 001 - 为 id 001 执行 function
- Executing the function for id 002 - 为 id 002 执行 function
- Executing the function for id 003 - 为 id 003 执行 function
- Executing the function for id 004 - 为 id 004 执行 function

. . . . . . . .
- Executing the function for id 030 - 为 id 030 执行 function
Finished execution...执行完毕...

All at the same time (concurrently) once同时（同时）一次

[Update] Based on the answer on the comments (threading module): [更新]基于评论的答案（线程模块）：

Answer 1

"for loop" is linear execution/ Sequential execution and can be considered as single threaded execution. “for循环”是线性执行/顺序执行，可以认为是单线程执行。

If you want to run your code concurrently, you need to create multiple threads/processes to execute your code.如果你想同时运行你的代码，你需要创建多个线程/进程来执行你的代码。

Below is the example to achieve multi threading.下面是实现多线程的例子。 I didn't test the code, but should work:)我没有测试代码，但应该可以工作:)

#importing threading library

import threading

# Creating a list of threads
thread_list = []

#looping all objects, creating a thread for each element in the loop, and append them to thread_list
for items in device_ids:
    thread = threading.Thread(target=training_models_operation_testing,args=(items, datalake_spark_dataframe_downsampled, drop_columns_not_used_in_training,
                                                   training_split_ratio_value, testing_split_ratio_value, mlflow_folder,
                                                   cross_validation_rounds_value, features_column_name,
                                                   optimization_metric_value, pretrained_models_T_minus_one,
                                                   folder_name_T_minus_one, timestamp_snap, instrumentation_key_value,
                                                   canditate_asset_ids, executor, device_ids,))
    thread_list.append(thread)

#Start multi threaded exucution
for thread in thread_list:
    thread.start()

#Wait for all threads to finish
for thread in thread_list:
    thread.join()

print("Finished executing all threads")

在 pyspark 中同时运行 for 循环，而不是按顺序运行

问题描述

1 个解决方案

解决方案1
3 已采纳 2020-06-04 14:24:35

在 pyspark 中同时运行 for 循环，而不是按顺序运行

问题描述

1 个解决方案

解决方案1 3 已采纳 2020-06-04 14:24:35

解决方案1
3 已采纳 2020-06-04 14:24:35