如何使用 apache spark scala 读取 XLSX 的整个目录？

Question

I have to read a whole directory of xlsx files, and I need to load all the directory with Apache Spark using Scala.我必须读取 xlsx 文件的整个目录，并且需要使用 Scala 使用 Apache Spark 加载所有目录。

Actually I'm using this dependency: "com.crealytics" %% "spark-excel" % "0.12.3" , and I don't know how to load all.实际上我正在使用这个依赖： "com.crealytics" %% "spark-excel" % "0.12.3" ，我不知道如何加载所有。

Answer 1

There doesnt seem a shortcut option to be put into the path through option method.似乎没有通过选项方法将快捷选项放入路径中。 So I have created a workaround as below(assuming each excel file has same number of columns).所以我创建了一个如下的解决方法（假设每个 excel 文件具有相同的列数）。 Created a method to get all the paths of every file in the source directory and ran a loop over those file paths creating new dataframe and appending to the previous one.创建了一种方法来获取源目录中每个文件的所有路径，并在这些文件路径上运行循环，创建新的 dataframe 并附加到前一个路径。

import java.io.File
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._

def getListOfFiles(dir : String) : List[File] = {
  val d = new File(dir)
  if (d.exists && d.isDirectory){
    d.listFiles().filter(_.isFile).toList
  } else {
      List[File]()
  }
}

val path = " \\directory path"

// shows list of files with fully qualified paths
println(getListOfFiles(path))

val schema = StructType(
    StructField("id", IntegerType, true) ::
    StructField("name", StringType, false) ::
    StructField("age", IntegerType, false) :: Nil)


// Created Empty dataframe with as many columns as in each excel
var data = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], schema)
for(filePath <- getListOfFiles(path)){
  var tempDF = spark.read.format("com.crealytics.spark.excel")
    .option("location", s"$filePath")
    .option("useHeader", "true")
    .option("treatEmptyValuesAsNulls", "true")
    .option("inferSchema", "true")
    .option("addColorColumns", "False")
    .load()
  data = data.union(tempDF)
}

data.show()

如何使用 apache spark scala 读取 XLSX 的整个目录？

问题描述

1 个解决方案

解决方案1
1 已采纳 2019-10-19 19:06:31

如何使用 apache spark scala 读取 XLSX 的整个目录？

问题描述

1 个解决方案

解决方案1 1 已采纳 2019-10-19 19:06:31

解决方案1
1 已采纳 2019-10-19 19:06:31