I'm kinda new in PySpark and I'm trying to perform a foreachPartition function in my dataframe and then I want to perform another function with the same dataframe. The problem is that after using the foreachPartition function, my dataframe gets empty, so I cannot do anything else with it. My code looks like the following:
def my_random_function(partition, parameters):
#performs something with the dataframe
#does not return anything
my_py_spark_dataframe.foreachPartition(
lambda partition: my_random_function(partition, parameters))
Could someone tell me how can I perform this foreachPartition and also use the same dataframe to perform other functions?
I saw some users talking about copying the dataframe using df.toPandas().copy() but in my case this causes some perform issues, so I would like to use the same dataframe instead of creating a new one.
Thank you in advance!
It is not clear which operation you are trying; but here is a sample usage of foreachPartition :
The sample data is a list of coutries from three continents:
+---------+-------+
|Continent|Country|
+---------+-------+
| NA| USA|
| NA| Canada|
| NA| Mexico|
| EU|England|
| EU| France|
| EU|Germany|
| ASIA| India|
| ASIA| China|
| ASIA| Japan|
+---------+-------+
Following code partitions the data by "Continent", iterates each partition using foreachPartition
and writes the "Country" name to each file of that specific partition ie continent.
df = spark.createDataFrame(data=[["NA", "USA"], ["NA", "Canada"], ["NA", "Mexico"], ["EU", "England"], ["EU", "France"], ["EU", "Germany"], ["ASIA", "India"], ["ASIA", "China"], ["ASIA", "Japan"]], schema=["Continent", "Country"])
df.withColumn("partition_id", F.spark_partition_id()).show()
df = df.repartition(F.col("Continent"))
df.withColumn("partition_id", F.spark_partition_id()).show()
def write_to_file(rows):
for row in rows:
with open(f"/content/sample_data/{row.Continent}.txt", "a+") as f:
f.write(f"{row.Country}\n")
df.foreachPartition(write_to_file)
Output:
Three files: one for each partition.
!ls -1 /content/sample_data/
ASIA.txt
EU.txt
NA.txt
Each file has country names for that continent (partition):
!cat /content/sample_data/ASIA.txt
India
China
Japan
!cat /content/sample_data/EU.txt
England
France
Germany
!cat /content/sample_data/NA.txt
USA
Canada
Mexico
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.