Getting empty dataframe after foreachPartition execution in Pyspark

Question

I'm kinda new in PySpark and I'm trying to perform a foreachPartition function in my dataframe and then I want to perform another function with the same dataframe. The problem is that after using the foreachPartition function, my dataframe gets empty, so I cannot do anything else with it. My code looks like the following:

def my_random_function(partition, parameters):
    #performs something with the dataframe
    #does not return anything

my_py_spark_dataframe.foreachPartition(
    lambda partition: my_random_function(partition, parameters))

Could someone tell me how can I perform this foreachPartition and also use the same dataframe to perform other functions?

I saw some users talking about copying the dataframe using df.toPandas().copy() but in my case this causes some perform issues, so I would like to use the same dataframe instead of creating a new one.

Thank you in advance!

Answer 1

It is not clear which operation you are trying; but here is a sample usage of foreachPartition :

The sample data is a list of coutries from three continents:

+---------+-------+
|Continent|Country|
+---------+-------+
|       NA|    USA|
|       NA| Canada|
|       NA| Mexico|
|       EU|England|
|       EU| France|
|       EU|Germany|
|     ASIA|  India|
|     ASIA|  China|
|     ASIA|  Japan|
+---------+-------+

Following code partitions the data by "Continent", iterates each partition using foreachPartition and writes the "Country" name to each file of that specific partition ie continent.

df = spark.createDataFrame(data=[["NA", "USA"], ["NA", "Canada"], ["NA", "Mexico"], ["EU", "England"], ["EU", "France"], ["EU", "Germany"], ["ASIA", "India"], ["ASIA", "China"], ["ASIA", "Japan"]], schema=["Continent", "Country"])
df.withColumn("partition_id", F.spark_partition_id()).show()

df = df.repartition(F.col("Continent"))
df.withColumn("partition_id", F.spark_partition_id()).show()

def write_to_file(rows):
  for row in rows:
    with open(f"/content/sample_data/{row.Continent}.txt", "a+") as f:
      f.write(f"{row.Country}\n")

df.foreachPartition(write_to_file)

Output:

Three files: one for each partition.

!ls -1 /content/sample_data/

ASIA.txt
EU.txt
NA.txt

Each file has country names for that continent (partition):

!cat /content/sample_data/ASIA.txt
India
China
Japan

!cat /content/sample_data/EU.txt
England
France
Germany

!cat /content/sample_data/NA.txt
USA
Canada
Mexico

Getting empty dataframe after foreachPartition execution in Pyspark

Question

1 answers

solution1
0 2022-12-17 14:46:10

Getting empty dataframe after foreachPartition execution in Pyspark

Question

1 answers

solution1 0 2022-12-17 14:46:10

solution1
0 2022-12-17 14:46:10