如何读取多个 Excel 文件并将它们连接成一个 Apache Spark DataFrame？

Question

Recently I wanted to do Spark Machine Learning Lab from Spark Summit 2016. Training video is here and exported notebook is available here.最近我想在 Spark Summit 2016 上做 Spark Machine Learning Lab。培训视频在这里，导出的 notebook 在这里可用。

The dataset used in the lab can be downloaded from UCI Machine Learning Repository .实验室使用的数据集可以从UCI Machine Learning Repository下载。 It contains a set of readings from various sensors in a gas-fired power generation plant.它包含来自燃气发电厂中各种传感器的一组读数。 The format is xlsx file with five sheets.格式为 xlsx 文件，包含五张纸。

To use the data in the lab I needed to read all the sheets form the Excel file and to concatenate them into one Spark DataFrame.为了使用实验室中的数据，我需要读取 Excel 文件中的所有工作表，并将它们连接成一个 Spark DataFrame。 During the training they are using Databricks Notebook but I was using IntelliJ IDEA with Scala and evaluating the code in the console.在培训期间，他们使用 Databricks Notebook，但我使用 IntelliJ IDEA 和 Scala 并在控制台中评估代码。

The first step was to save all the Excel sheets into separate xlsx files named sheet1.xlxs , sheet2.xlsx etc. and put them into sheets directory.第一步是将所有 Excel 工作表保存到名为sheet1.xlxs 、 sheet2.xlsx等的单独 xlsx 文件中，并将它们放入sheets目录中。

How to read all the Excel files and concatenate them into one Apache Spark DataFrame?如何读取所有 Excel 文件并将它们连接成一个 Apache Spark DataFrame？

Answer 1

For this I have used spark-excel package.为此，我使用了spark-excel包。 It can be added to build.sbt file as : libraryDependencies += "com.crealytics" %% "spark-excel" % "0.8.2"它可以添加到 build.sbt 文件中： libraryDependencies += "com.crealytics" %% "spark-excel" % "0.8.2"

The code to execute in IntelliJ IDEA Scala Console was:在 IntelliJ IDEA Scala 控制台中执行的代码是：

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{SparkSession, DataFrame}
import java.io.File

val conf = new SparkConf().setAppName("Excel to DataFrame").setMaster("local[*]")
val sc = new SparkContext(conf)
sc.setLogLevel("WARN")

val spark = SparkSession.builder().getOrCreate()

// Function to read xlsx file using spark-excel. 
// This code format with "trailing dots" can be sent to IJ Scala Console as a block.
def readExcel(file: String): DataFrame = spark.read.
  format("com.crealytics.spark.excel").
  option("location", file).
  option("useHeader", "true").
  option("treatEmptyValuesAsNulls", "true").
  option("inferSchema", "true").
  option("addColorColumns", "False").
  load()

val dir = new File("./data/CCPP/sheets")
val excelFiles = dir.listFiles.sorted.map(f => f.toString)  // Array[String]

val dfs = excelFiles.map(f => readExcel(f))  // Array[DataFrame]
val ppdf = dfs.reduce(_.union(_))  // DataFrame 

ppdf.count()  // res3: Long = 47840
ppdf.show(5)

Console output:控制台输出：

+-----+-----+-------+-----+------+
|   AT|    V|     AP|   RH|    PE|
+-----+-----+-------+-----+------+
|14.96|41.76|1024.07|73.17|463.26|
|25.18|62.96|1020.04|59.08|444.37|
| 5.11| 39.4|1012.16|92.14|488.56|
|20.86|57.32|1010.24|76.64|446.48|
|10.82| 37.5|1009.23|96.62| 473.9|
+-----+-----+-------+-----+------+
only showing top 5 rows

Answer 2

Hope this Spark Scala code might help.希望这个 Spark Scala 代码可能会有所帮助。

import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{Path, FileSystem}
import org.apache.spark.deploy.SparkHadoopUtil
import org.apache.spark.sql.execution.datasources.InMemoryFileIndex
import java.net.URI

def listFiles(basep: String, globp: String): Seq[String] = {
  val conf = new Configuration(sc.hadoopConfiguration)
  val fs = FileSystem.get(new URI(basep), conf)

  def validated(path: String): Path = {
    if(path startsWith "/") new Path(path)
    else new Path("/" + path)
  }

  val fileCatalog = InMemoryFileIndex.bulkListLeafFiles(
    paths = SparkHadoopUtil.get.globPath(fs, Path.mergePaths(validated(basep), validated(globp))),
    hadoopConf = conf,
    filter = null,
    sparkSession = spark)

  fileCatalog.flatMap(_._2.map(_.path))
}

val root = "/mnt/{path to your file directory}"
val globp = "[^_]*"

val files = listFiles(root, globp)
val paths=files.toVector

Loop the vector to read multiple files:循环向量以读取多个文件：

for (path <- paths) {
     print(path.toString)

     val df= spark.read.
                   format("com.crealytics.spark.excel").
                   option("useHeader", "true").
                   option("treatEmptyValuesAsNulls", "false").
                   option("inferSchema", "false"). 
                   option("addColorColumns", "false").
                   load(path.toString)
}

Answer 3

We need spark-excel library for this, can be obtained from为此我们需要 spark-excel 库，可以从

https://github.com/crealytics/spark-excel#scala-api https://github.com/crealytics/spark-excel#scala-api

clone the git project from above github link and build using "sbt package"从上面的 github 链接克隆 git 项目并使用“sbt 包”构建
Using Spark 2 to run the spark-shell使用 Spark 2 运行 spark-shell

spark-shell --driver-class-path ./spark-excel_2.11-0.8.3.jar --master=yarn-client spark-shell --driver-class-path ./spark-excel_2.11-0.8.3.jar --master=yarn-client

Import the necessary导入必要的

import org.apache.spark.sql._导入 org.apache.spark.sql._
import org.apache.spark.sql.functions._导入 org.apache.spark.sql.functions._
val sqlContext = new SQLContext(sc)

Set excel doc path设置excel文档路径

val document = "path to excel doc"

Execute the below function for creating dataframe out of it执行以下函数以从中创建数据帧

val dataDF = sqlContext.read .format("com.crealytics.spark.excel") .option("sheetName", "Sheet Name") .option("useHeader", "true") .option("treatEmptyValuesAsNulls", "false") .option("inferSchema", "false") .option("location", document) .option("addColorColumns", "false") .load(document)

That's all!就这样！ now you can perform the Dataframe operation on the dataDF object.现在您可以对dataDF对象执行 Dataframe 操作。

如何读取多个 Excel 文件并将它们连接成一个 Apache Spark DataFrame？

问题描述

3 个解决方案

解决方案1
4 2017-03-12 14:38:39

解决方案2
0 2020-02-22 17:04:19

解决方案3
-1 2017-12-08 19:58:06

如何读取多个 Excel 文件并将它们连接成一个 Apache Spark DataFrame？

问题描述

3 个解决方案

解决方案1 4 2017-03-12 14:38:39

解决方案2 0 2020-02-22 17:04:19

解决方案3 -1 2017-12-08 19:58:06

解决方案1
4 2017-03-12 14:38:39

解决方案2
0 2020-02-22 17:04:19

解决方案3
-1 2017-12-08 19:58:06