[英]How to read multiple Excel files and concatenate them into one Apache Spark DataFrame?
Recently I wanted to do Spark Machine Learning Lab from Spark Summit 2016. Training video is here and exported notebook is available here.最近我想在 Spark Summit 2016 上做 Spark Machine Learning Lab。培训视频在这里,导出的 notebook 在这里可用 。
The dataset used in the lab can be downloaded from UCI Machine Learning Repository .实验室使用的数据集可以从UCI Machine Learning Repository下载。 It contains a set of readings from various sensors in a gas-fired power generation plant.
它包含来自燃气发电厂中各种传感器的一组读数。 The format is xlsx file with five sheets.
格式为 xlsx 文件,包含五张纸。
To use the data in the lab I needed to read all the sheets form the Excel file and to concatenate them into one Spark DataFrame.为了使用实验室中的数据,我需要读取 Excel 文件中的所有工作表,并将它们连接成一个 Spark DataFrame。 During the training they are using Databricks Notebook but I was using IntelliJ IDEA with Scala and evaluating the code in the console.
在培训期间,他们使用 Databricks Notebook,但我使用 IntelliJ IDEA 和 Scala 并在控制台中评估代码。
The first step was to save all the Excel sheets into separate xlsx files named sheet1.xlxs
, sheet2.xlsx
etc. and put them into sheets
directory.第一步是将所有 Excel 工作表保存到名为
sheet1.xlxs
、 sheet2.xlsx
等的单独 xlsx 文件中,并将它们放入sheets
目录中。
How to read all the Excel files and concatenate them into one Apache Spark DataFrame?如何读取所有 Excel 文件并将它们连接成一个 Apache Spark DataFrame?
For this I have used spark-excel package.为此,我使用了spark-excel包。 It can be added to build.sbt file as :
libraryDependencies += "com.crealytics" %% "spark-excel" % "0.8.2"
它可以添加到 build.sbt 文件中:
libraryDependencies += "com.crealytics" %% "spark-excel" % "0.8.2"
The code to execute in IntelliJ IDEA Scala Console was:在 IntelliJ IDEA Scala 控制台中执行的代码是:
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{SparkSession, DataFrame}
import java.io.File
val conf = new SparkConf().setAppName("Excel to DataFrame").setMaster("local[*]")
val sc = new SparkContext(conf)
sc.setLogLevel("WARN")
val spark = SparkSession.builder().getOrCreate()
// Function to read xlsx file using spark-excel.
// This code format with "trailing dots" can be sent to IJ Scala Console as a block.
def readExcel(file: String): DataFrame = spark.read.
format("com.crealytics.spark.excel").
option("location", file).
option("useHeader", "true").
option("treatEmptyValuesAsNulls", "true").
option("inferSchema", "true").
option("addColorColumns", "False").
load()
val dir = new File("./data/CCPP/sheets")
val excelFiles = dir.listFiles.sorted.map(f => f.toString) // Array[String]
val dfs = excelFiles.map(f => readExcel(f)) // Array[DataFrame]
val ppdf = dfs.reduce(_.union(_)) // DataFrame
ppdf.count() // res3: Long = 47840
ppdf.show(5)
Console output:控制台输出:
+-----+-----+-------+-----+------+
| AT| V| AP| RH| PE|
+-----+-----+-------+-----+------+
|14.96|41.76|1024.07|73.17|463.26|
|25.18|62.96|1020.04|59.08|444.37|
| 5.11| 39.4|1012.16|92.14|488.56|
|20.86|57.32|1010.24|76.64|446.48|
|10.82| 37.5|1009.23|96.62| 473.9|
+-----+-----+-------+-----+------+
only showing top 5 rows
Hope this Spark Scala code might help.希望这个 Spark Scala 代码可能会有所帮助。
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{Path, FileSystem}
import org.apache.spark.deploy.SparkHadoopUtil
import org.apache.spark.sql.execution.datasources.InMemoryFileIndex
import java.net.URI
def listFiles(basep: String, globp: String): Seq[String] = {
val conf = new Configuration(sc.hadoopConfiguration)
val fs = FileSystem.get(new URI(basep), conf)
def validated(path: String): Path = {
if(path startsWith "/") new Path(path)
else new Path("/" + path)
}
val fileCatalog = InMemoryFileIndex.bulkListLeafFiles(
paths = SparkHadoopUtil.get.globPath(fs, Path.mergePaths(validated(basep), validated(globp))),
hadoopConf = conf,
filter = null,
sparkSession = spark)
fileCatalog.flatMap(_._2.map(_.path))
}
val root = "/mnt/{path to your file directory}"
val globp = "[^_]*"
val files = listFiles(root, globp)
val paths=files.toVector
Loop the vector to read multiple files:循环向量以读取多个文件:
for (path <- paths) {
print(path.toString)
val df= spark.read.
format("com.crealytics.spark.excel").
option("useHeader", "true").
option("treatEmptyValuesAsNulls", "false").
option("inferSchema", "false").
option("addColorColumns", "false").
load(path.toString)
}
We need spark-excel library for this, can be obtained from为此我们需要 spark-excel 库,可以从
https://github.com/crealytics/spark-excel#scala-api https://github.com/crealytics/spark-excel#scala-api
spark-shell --driver-class-path ./spark-excel_2.11-0.8.3.jar --master=yarn-client
spark-shell --driver-class-path ./spark-excel_2.11-0.8.3.jar --master=yarn-client
import org.apache.spark.sql._
导入 org.apache.spark.sql._
import org.apache.spark.sql.functions._导入 org.apache.spark.sql.functions._
val sqlContext = new SQLContext(sc)
val document = "path to excel doc"
val dataDF = sqlContext.read .format("com.crealytics.spark.excel") .option("sheetName", "Sheet Name") .option("useHeader", "true") .option("treatEmptyValuesAsNulls", "false") .option("inferSchema", "false") .option("location", document) .option("addColorColumns", "false") .load(document)
That's all!就这样! now you can perform the Dataframe operation on the dataDF object.
现在您可以对dataDF对象执行 Dataframe 操作。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.