简体   繁体   English

spark python读取多个csv到数据框

[英]spark python read multiple csv's to dataframe

I have multiple csv files on a datalake.我在数据湖上有多个 csv 文件。 I can connect to the datalake and can even list the files.我可以连接到数据湖,甚至可以列出文件。 But I need to put these files in one dataframe, so I can load this dataframe to SQL.但是我需要将这些文件放在一个数据框中,这样我就可以将这个数据框加载到 SQL 中。 Loading to SQL is also no problem.加载到SQL也没有问题。 The problem is that only the content of the last file in the datalakefolder is read und written to SQL (and thus also in the dataframe).问题是只有 datalakefolder 中最后一个文件的内容被读取和写入 SQL(因此也在数据框中)。 Probably because the dataframe is overwritten each time.可能是因为数据帧每次都被覆盖。 But I don't know how to append data to the dataframe on each cycle Here's the code I use:但我不知道如何在每个周期将数据附加到数据框这是我使用的代码:

    for file in dayfolders.collect():
      filename = file.name
      pathname = file.path
      tablename = "Obelix" 
      if filename.endswith(".csv"): 
          df = spark.read.format("csv")\
          .option("inferschema", "true")\
          .option("header","true")\
          .load(file.path)
          continue
      else:
          continue 

If I put the statement print(filename) directly after the the for statement I can see it loops through the three files.如果我将语句 print(filename) 直接放在 for 语句之后,我可以看到它循环遍历三个文件。 All files seperatly are processed just fine所有文件单独处理就好了

You can import using a list of files.您可以使用文件列表导入。 They'll be automatically combined together for you.它们会自动为您组合在一起。

csv_import = sqlContext.read\
  .format('csv')\
  .options(sep = ',', header='true', inferSchema='true')\
  .load([file.path for file in dayfolders.collect()])\
  .createOrReplaceTempView(<temporary table name>)

If you're set on reading in files as individual dataframes then you need to union each dataframe together:如果您打算将文件作为单独的数据帧读取,那么您需要将每个数据帧联合在一起:

for ind, file in enumerate(dayfolders.collect()):
  if ind == 0:
    df = spark.read.format("csv")\
      .option("inferschema", "true")\
      .option("header","true")\
      .load(file.path)
  else:
    df = df.union(spark.read.format("csv")\
      .option("inferschema", "true")\
      .option("header","true")\
      .load(file.path))

I do not recommend you do that.我不建议你这样做。 Just use the first method.使用第一种方法即可。

You dont have to give the ForLoop.You can give "dayfolders/*.csv" in load and it will load all the files directly and combines it to a dataframe.您不必提供 ForLoop。您可以在加载中提供“dayfolders/*.csv”,它会直接加载所有文件并将其组合到一个数据帧中。

f = spark.read.format("csv")\
          .option("inferschema", "true")\
          .option("header","true")\
          .load(dayfolders/*.csv")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM