简体   繁体   中英

spark python read multiple csv's to dataframe

I have multiple csv files on a datalake. I can connect to the datalake and can even list the files. But I need to put these files in one dataframe, so I can load this dataframe to SQL. Loading to SQL is also no problem. The problem is that only the content of the last file in the datalakefolder is read und written to SQL (and thus also in the dataframe). Probably because the dataframe is overwritten each time. But I don't know how to append data to the dataframe on each cycle Here's the code I use:

    for file in dayfolders.collect():
      filename = file.name
      pathname = file.path
      tablename = "Obelix" 
      if filename.endswith(".csv"): 
          df = spark.read.format("csv")\
          .option("inferschema", "true")\
          .option("header","true")\
          .load(file.path)
          continue
      else:
          continue 

If I put the statement print(filename) directly after the the for statement I can see it loops through the three files. All files seperatly are processed just fine

You can import using a list of files. They'll be automatically combined together for you.

csv_import = sqlContext.read\
  .format('csv')\
  .options(sep = ',', header='true', inferSchema='true')\
  .load([file.path for file in dayfolders.collect()])\
  .createOrReplaceTempView(<temporary table name>)

If you're set on reading in files as individual dataframes then you need to union each dataframe together:

for ind, file in enumerate(dayfolders.collect()):
  if ind == 0:
    df = spark.read.format("csv")\
      .option("inferschema", "true")\
      .option("header","true")\
      .load(file.path)
  else:
    df = df.union(spark.read.format("csv")\
      .option("inferschema", "true")\
      .option("header","true")\
      .load(file.path))

I do not recommend you do that. Just use the first method.

You dont have to give the ForLoop.You can give "dayfolders/*.csv" in load and it will load all the files directly and combines it to a dataframe.

f = spark.read.format("csv")\
          .option("inferschema", "true")\
          .option("header","true")\
          .load(dayfolders/*.csv")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM