在 Spark 中读取多个 CSV 文件并制作 DataFrame

Question

I am using following code to read multiple csv files and and converting them to pandas df then concat it as a single pandas df.我正在使用以下代码读取多个 csv 文件并将它们转换为 pandas df，然后将其连接为单个 pandas df。 Finally converting again into spark DataFrame.最后再次转换成火花 DataFrame。 I want to skip conversion to pandas df part and simply want to have spark DataFrame.我想跳过转换为 pandas df 部分，只是想拥有火花 DataFrame。

File Paths文件路径

 abfss://xxxxxx/abc/year=2021/month=1/dayofmonth=1/hour=1/*.csv
 abfss://xxxxxx/abc/year=2021/month=1/dayofmonth=1/hour=2/*.csv
......

Code代码

list = []


for month in range(1,3,1):
  for day in range(1,31,1):
    for hour in range(0,24,1):
      file_location = "//xxxxxx/abc/year=2021/month="+str(month)+"/dayofmonth="+str(day)+"/hour="+str(hour)+"/*.csv"    
     
        try : 
          spark_df = spark.read.format("csv").option("header", "true").load(file_location)
          pandas_df = spark_df.toPandas()
          list.append(pandas_df)
    
    
        except AnalysisException as e:
          print(e)

final_pandas_df = pd.concat(list)
df = spark.createDataFrame(final_pandas_df)

Answer 1

You can load all the files and apply a filter on the partitioning columns:您可以加载所有文件并对分区列应用过滤器：

df = spark.read.format("csv").option("header", "true").load("abfss://xxxxxx/abc/").filter(
    'year = 2021 and month between 1 and 2 and day between 1 and 30 and hour between 0 and 23'
)

在 Spark 中读取多个 CSV 文件并制作 DataFrame

问题描述

1 个解决方案

解决方案1
1 已采纳 2021-04-14 10:34:34

在 Spark 中读取多个 CSV 文件并制作 DataFrame

问题描述

1 个解决方案

解决方案1 1 已采纳 2021-04-14 10:34:34

解决方案1
1 已采纳 2021-04-14 10:34:34