spark python read multiple csv's to dataframe

Question

I have multiple csv files on a datalake. I can connect to the datalake and can even list the files. But I need to put these files in one dataframe, so I can load this dataframe to SQL. Loading to SQL is also no problem. The problem is that only the content of the last file in the datalakefolder is read und written to SQL (and thus also in the dataframe). Probably because the dataframe is overwritten each time. But I don't know how to append data to the dataframe on each cycle Here's the code I use:

    for file in dayfolders.collect():
      filename = file.name
      pathname = file.path
      tablename = "Obelix" 
      if filename.endswith(".csv"): 
          df = spark.read.format("csv")\
          .option("inferschema", "true")\
          .option("header","true")\
          .load(file.path)
          continue
      else:
          continue

If I put the statement print(filename) directly after the the for statement I can see it loops through the three files. All files seperatly are processed just fine

Answer 1

You can import using a list of files. They'll be automatically combined together for you.

csv_import = sqlContext.read\
  .format('csv')\
  .options(sep = ',', header='true', inferSchema='true')\
  .load([file.path for file in dayfolders.collect()])\
  .createOrReplaceTempView(<temporary table name>)

If you're set on reading in files as individual dataframes then you need to union each dataframe together:

for ind, file in enumerate(dayfolders.collect()):
  if ind == 0:
    df = spark.read.format("csv")\
      .option("inferschema", "true")\
      .option("header","true")\
      .load(file.path)
  else:
    df = df.union(spark.read.format("csv")\
      .option("inferschema", "true")\
      .option("header","true")\
      .load(file.path))

I do not recommend you do that. Just use the first method.

Answer 2

You dont have to give the ForLoop.You can give "dayfolders/*.csv" in load and it will load all the files directly and combines it to a dataframe.

f = spark.read.format("csv")\
          .option("inferschema", "true")\
          .option("header","true")\
          .load(dayfolders/*.csv")

spark python read multiple csv's to dataframe

Question

2 answers

solution1
0 ACCPTED 2020-01-24 02:59:45

solution2
0 2020-01-24 13:25:15

spark python read multiple csv's to dataframe

Question

2 answers

solution1 0 ACCPTED 2020-01-24 02:59:45

solution2 0 2020-01-24 13:25:15

solution1
0 ACCPTED 2020-01-24 02:59:45

solution2
0 2020-01-24 13:25:15