pyspark 内存高效循环将指标列添加到 dataframe

Question

Is there a way to convert the following into code that takes advantage of pyspark parallelization in the for loop?有没有办法将以下代码转换为利用for循环中的 pyspark 并行化的代码？

import pyspark.sql.functions as F

my_list_of_integers = list(df_column_of_integers.select('column_name').toPandas()['column_name'])

for my_int in my_list_of_integers:    
    temp_df = large_df1.filter(large_df1.a_value_column == my_int)
    temp_df = temp_df.select("a_key_column")
    temp_df = temp_df.withColumn("indicator" + str(my_int), F.lit(1))
    large_df2 = large_df2.join(temp_df, on="a_key_column", how="left")

After going through the for loop 7 times (the goal was 185), the code fails and gives this error message:经过 7 次for循环（目标是 185）后，代码失败并给出以下错误消息：

org.apache.spark.memory.SparkOutOfMemoryError: Unable to acquire 52 bytes of memory, got 0

An additional error message that is reported from the system I am working in suggests how to resolve the issue:我正在使用的系统报告的另一条错误消息建议如何解决该问题：

Your job has exceeded the memory overhead.您的作业已超出 memory 开销。 Your code might be attempting to run wholly on one executor, which can happen if you haven't used pySpark;您的代码可能试图完全在一个执行器上运行，如果您没有使用 pySpark，就会发生这种情况； for instance if you're using Pandas or UDFs.例如，如果您使用的是 Pandas 或 UDF。

For a simple, working example , here is sample input and a visualization of the expected output from this sample input:对于一个简单的工作示例，这里是示例输入和来自此示例输入的预期 output 的可视化：

Sample input:样本输入：

df_column_of_integers = spark.createDataFrame([10, 11, 13], IntegerType()).toDF('column_name')

df1_key_column = [52, 52, 53, 53, 52, 52]
a_value_column = [9, 13, 10, 11, 12, 10]
large_df1 = sqlContext.createDataFrame(zip(df1_key_column, a_value_column), schema=['a_key_column', 'a_value_column'])

large_df2 = spark.createDataFrame([52, 54, 53], IntegerType()).toDF('a_key_column')

Expected output (ie, the final version of large_df2 for the simple example above):预期的 output（即上面简单示例的 large_df2 的最终版本）：

+--------------+-------------+-------------+-------------+                                                              
|  a_key_column|  indicator10|  indicator11|  indicator13|
+--------------+-------------+-------------+-------------+
|            52|            1|         NULL|            1|
|            54|         NULL|         NULL|         NULL|
|            53|            1|            1|         NULL|
+--------------+-------------+-------------+-------------+

In actuality , my df_column_of_integers has 185 entries.实际上，我的 df_column_of_integers 有 185 个条目。 large_df1 has 82 million rows and 2 columns before it is filtered in the first step of the for loop, and at most 0.9 million rows after that filter. large_df1 在for循环的第一步被过滤之前有 8200 万行和 2 列，在该过滤器之后最多有 90 万行。 large_df2 starts with 0.9 million rows and 33 columns (23 of which are Integers). large_df2 以 90 万行和 33 列开始（其中 23 个是整数）。 From the detailed error message, it seems the error is occurring during the join.从详细的错误消息来看，似乎是在加入期间发生了错误。 However, I have joined larger datasets on this system in past, just not in a for loop on a Pandas list, so it makes me think the source of the issue is the use of a Pandas list which prompts the use of a single executor.但是，我过去在这个系统上加入了更大的数据集，而不是在 Pandas 列表上的for循环中，所以它让我认为问题的根源是使用了 Pandas 列表，它提示使用单个执行程序。 Thus, I am thinking that there might be a better loop technique that someone might know.因此，我认为可能会有更好的循环技术，有人可能知道。

I tried using.foreach with a lambda function, as described here: https://sparkbyexamples.com/pyspark/pyspark-loop-iterate-through-rows-in-dataframe/ , but I cannot figure out how to add large_df1 and large_df2 as additional inputs to the lambda function. I tried using.foreach with a lambda function, as described here: https://sparkbyexamples.com/pyspark/pyspark-loop-iterate-through-rows-in-dataframe/ , but I cannot figure out how to add large_df1 and large_df2作为 lambda function 的附加输入。 And I don't think.map would be helpful, because I don't want to edit my_list_of_integers, only interate over its values.而且我不认为。map 会有所帮助，因为我不想编辑 my_list_of_integers，只想对其值进行交互。

Thank you in advance!先感谢您！

Answer 1

I solved my problem: I replaced everything within the for loop with:我解决了我的问题：我将for循环中的所有内容替换为：

import pyspark.sql.functions as F

large_df1 = large_df1.filter(large_df1.a_value_column.isin(my_list_of_integers))
large_df1 = large_df1.withColumn("my_value". F.lit(1))
large_df1 = large_df1.groupBy("a_key_column").pivot("a_value_column", my_list_of_integers).agg(F.first(F.col("my_value")))
large_df2 = large_df2.join(large_df1, on="a_key_column", how="left")

The key step is the use of pivot .关键步骤是使用pivot 。 The new code runs easily and quickly.新代码运行方便快捷。

pyspark 内存高效循环将指标列添加到 dataframe

问题描述

1 个解决方案

解决方案1
0 已采纳 2021-06-09 04:54:04

pyspark 内存高效循环将指标列添加到 dataframe

问题描述

1 个解决方案

解决方案1 0 已采纳 2021-06-09 04:54:04

解决方案1
0 已采纳 2021-06-09 04:54:04