简体   繁体   中英

join dataframes by iterating through dictionary

import os
import glob
from pyspark.sql import functions as F
 
result = get_csv_files("/test")
d= {} 
i = 1
# loop over the list of csv files
for k,v in result.items():
    df =   spark.read.option("header", True).csv(k)
    d[i] =df
    i = i + 1

I can read all csv files in folder using the get_csv_file function and output a dictionary of key (filename) and value (dataframe) pairs.

I want to join all the dataframes from the above code using two columns present in all files ("colA" and "colB"). Is it possible in a loop instead of hard coding like below?

df_merged = d[1].join(d[2],["colA", "colB"], "outer").join(d[3],["colA", "colB"], "outer")  \
.join(d[4],["colA", "colB"], "outer").join(d[5],["colA", "colB"], "outer") \

You can use python's reduce function on dict's values, if all join keys are same.

joined_sdf = reduce(lambda x, y: x.join(y, ['colA', 'colB'], 'outer'), d.values())

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM