Is there a way to convert the following into code that takes advantage of pyspark parallelization in the for
loop?
import pyspark.sql.functions as F
my_list_of_integers = list(df_column_of_integers.select('column_name').toPandas()['column_name'])
for my_int in my_list_of_integers:
temp_df = large_df1.filter(large_df1.a_value_column == my_int)
temp_df = temp_df.select("a_key_column")
temp_df = temp_df.withColumn("indicator" + str(my_int), F.lit(1))
large_df2 = large_df2.join(temp_df, on="a_key_column", how="left")
After going through the for
loop 7 times (the goal was 185), the code fails and gives this error message:
org.apache.spark.memory.SparkOutOfMemoryError: Unable to acquire 52 bytes of memory, got 0
An additional error message that is reported from the system I am working in suggests how to resolve the issue:
Your job has exceeded the memory overhead. Your code might be attempting to run wholly on one executor, which can happen if you haven't used pySpark; for instance if you're using Pandas or UDFs.
For a simple, working example , here is sample input and a visualization of the expected output from this sample input:
Sample input:
df_column_of_integers = spark.createDataFrame([10, 11, 13], IntegerType()).toDF('column_name')
df1_key_column = [52, 52, 53, 53, 52, 52]
a_value_column = [9, 13, 10, 11, 12, 10]
large_df1 = sqlContext.createDataFrame(zip(df1_key_column, a_value_column), schema=['a_key_column', 'a_value_column'])
large_df2 = spark.createDataFrame([52, 54, 53], IntegerType()).toDF('a_key_column')
Expected output (ie, the final version of large_df2 for the simple example above):
+--------------+-------------+-------------+-------------+
| a_key_column| indicator10| indicator11| indicator13|
+--------------+-------------+-------------+-------------+
| 52| 1| NULL| 1|
| 54| NULL| NULL| NULL|
| 53| 1| 1| NULL|
+--------------+-------------+-------------+-------------+
In actuality , my df_column_of_integers has 185 entries. large_df1 has 82 million rows and 2 columns before it is filtered in the first step of the for
loop, and at most 0.9 million rows after that filter. large_df2 starts with 0.9 million rows and 33 columns (23 of which are Integers). From the detailed error message, it seems the error is occurring during the join. However, I have joined larger datasets on this system in past, just not in a for
loop on a Pandas list, so it makes me think the source of the issue is the use of a Pandas list which prompts the use of a single executor. Thus, I am thinking that there might be a better loop technique that someone might know.
I tried using.foreach with a lambda function, as described here: https://sparkbyexamples.com/pyspark/pyspark-loop-iterate-through-rows-in-dataframe/ , but I cannot figure out how to add large_df1 and large_df2 as additional inputs to the lambda function. And I don't think.map would be helpful, because I don't want to edit my_list_of_integers, only interate over its values.
Thank you in advance!
I solved my problem: I replaced everything within the for
loop with:
import pyspark.sql.functions as F
large_df1 = large_df1.filter(large_df1.a_value_column.isin(my_list_of_integers))
large_df1 = large_df1.withColumn("my_value". F.lit(1))
large_df1 = large_df1.groupBy("a_key_column").pivot("a_value_column", my_list_of_integers).agg(F.first(F.col("my_value")))
large_df2 = large_df2.join(large_df1, on="a_key_column", how="left")
The key step is the use of pivot
. The new code runs easily and quickly.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.