简体   繁体   English

如何读取多个 Excel 文件并将它们连接成一个 Apache Spark DataFrame?

[英]How to read multiple Excel files and concatenate them into one Apache Spark DataFrame?

Recently I wanted to do Spark Machine Learning Lab from Spark Summit 2016. Training video is here and exported notebook is available here.最近我想在 Spark Summit 2016 上做 Spark Machine Learning Lab。培训视频在这里,导出的 notebook 在这里可用

The dataset used in the lab can be downloaded from UCI Machine Learning Repository .实验室使用的数据集可以从UCI Machine Learning Repository下载。 It contains a set of readings from various sensors in a gas-fired power generation plant.它包含来自燃气发电厂中各种传感器的一组读数。 The format is xlsx file with five sheets.格式为 xlsx 文件,包含五张纸。

To use the data in the lab I needed to read all the sheets form the Excel file and to concatenate them into one Spark DataFrame.为了使用实验室中的数据,我需要读取 Excel 文件中的所有工作表,并将它们连接成一个 Spark DataFrame。 During the training they are using Databricks Notebook but I was using IntelliJ IDEA with Scala and evaluating the code in the console.在培训期间,他们使用 Databricks Notebook,但我使用 IntelliJ IDEA 和 Scala 并在控制台中评估代码。

The first step was to save all the Excel sheets into separate xlsx files named sheet1.xlxs , sheet2.xlsx etc. and put them into sheets directory.第一步是将所有 Excel 工作表保存到名为sheet1.xlxssheet2.xlsx等的单独 xlsx 文件中,并将它们放入sheets目录中。

How to read all the Excel files and concatenate them into one Apache Spark DataFrame?如何读取所有 Excel 文件并将它们连接成一个 Apache Spark DataFrame?

For this I have used spark-excel package.为此,我使用了spark-excel包。 It can be added to build.sbt file as : libraryDependencies += "com.crealytics" %% "spark-excel" % "0.8.2"它可以添加到 build.sbt 文件中: libraryDependencies += "com.crealytics" %% "spark-excel" % "0.8.2"

The code to execute in IntelliJ IDEA Scala Console was:在 IntelliJ IDEA Scala 控制台中执行的代码是:

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{SparkSession, DataFrame}
import java.io.File

val conf = new SparkConf().setAppName("Excel to DataFrame").setMaster("local[*]")
val sc = new SparkContext(conf)
sc.setLogLevel("WARN")

val spark = SparkSession.builder().getOrCreate()

// Function to read xlsx file using spark-excel. 
// This code format with "trailing dots" can be sent to IJ Scala Console as a block.
def readExcel(file: String): DataFrame = spark.read.
  format("com.crealytics.spark.excel").
  option("location", file).
  option("useHeader", "true").
  option("treatEmptyValuesAsNulls", "true").
  option("inferSchema", "true").
  option("addColorColumns", "False").
  load()

val dir = new File("./data/CCPP/sheets")
val excelFiles = dir.listFiles.sorted.map(f => f.toString)  // Array[String]

val dfs = excelFiles.map(f => readExcel(f))  // Array[DataFrame]
val ppdf = dfs.reduce(_.union(_))  // DataFrame 

ppdf.count()  // res3: Long = 47840
ppdf.show(5)

Console output:控制台输出:

+-----+-----+-------+-----+------+
|   AT|    V|     AP|   RH|    PE|
+-----+-----+-------+-----+------+
|14.96|41.76|1024.07|73.17|463.26|
|25.18|62.96|1020.04|59.08|444.37|
| 5.11| 39.4|1012.16|92.14|488.56|
|20.86|57.32|1010.24|76.64|446.48|
|10.82| 37.5|1009.23|96.62| 473.9|
+-----+-----+-------+-----+------+
only showing top 5 rows 

Hope this Spark Scala code might help.希望这个 Spark Scala 代码可能会有所帮助。

import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{Path, FileSystem}
import org.apache.spark.deploy.SparkHadoopUtil
import org.apache.spark.sql.execution.datasources.InMemoryFileIndex
import java.net.URI

def listFiles(basep: String, globp: String): Seq[String] = {
  val conf = new Configuration(sc.hadoopConfiguration)
  val fs = FileSystem.get(new URI(basep), conf)

  def validated(path: String): Path = {
    if(path startsWith "/") new Path(path)
    else new Path("/" + path)
  }

  val fileCatalog = InMemoryFileIndex.bulkListLeafFiles(
    paths = SparkHadoopUtil.get.globPath(fs, Path.mergePaths(validated(basep), validated(globp))),
    hadoopConf = conf,
    filter = null,
    sparkSession = spark)

  fileCatalog.flatMap(_._2.map(_.path))
}

val root = "/mnt/{path to your file directory}"
val globp = "[^_]*"

val files = listFiles(root, globp)
val paths=files.toVector

Loop the vector to read multiple files:循环向量以读取多个文件:

for (path <- paths) {
     print(path.toString)

     val df= spark.read.
                   format("com.crealytics.spark.excel").
                   option("useHeader", "true").
                   option("treatEmptyValuesAsNulls", "false").
                   option("inferSchema", "false"). 
                   option("addColorColumns", "false").
                   load(path.toString)
}

We need spark-excel library for this, can be obtained from为此我们需要 spark-excel 库,可以从

https://github.com/crealytics/spark-excel#scala-api https://github.com/crealytics/spark-excel#scala-api

  1. clone the git project from above github link and build using "sbt package"从上面的 github 链接克隆 git 项目并使用“sbt 包”构建
  2. Using Spark 2 to run the spark-shell使用 Spark 2 运行 spark-shell

spark-shell --driver-class-path ./spark-excel_2.11-0.8.3.jar --master=yarn-client spark-shell --driver-class-path ./spark-excel_2.11-0.8.3.jar --master=yarn-client

  1. Import the necessary导入必要的

import org.apache.spark.sql._导入 org.apache.spark.sql._
import org.apache.spark.sql.functions._导入 org.apache.spark.sql.functions._
val sqlContext = new SQLContext(sc)

  1. Set excel doc path设置excel文档路径

val document = "path to excel doc"

  1. Execute the below function for creating dataframe out of it执行以下函数以从中创建数据帧
val dataDF = sqlContext.read .format("com.crealytics.spark.excel") .option("sheetName", "Sheet Name") .option("useHeader", "true") .option("treatEmptyValuesAsNulls", "false") .option("inferSchema", "false") .option("location", document) .option("addColorColumns", "false") .load(document)

That's all!就这样! now you can perform the Dataframe operation on the dataDF object.现在您可以对dataDF对象执行 Dataframe 操作。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 将多个 excel 文件导入 python pandas 并拼接成一个 Z6A8064B5DF4794555500553C4DC7 - Import multiple excel files into python pandas and concatenate them into one dataframe 如何将多个工作表的多个excel文件加载到python中的一个数据框中 - How to load multiple excel files with multiple sheets in to one dataframe in python Python脚本读取一个目录中的多个excel文件并将它们转换为另一个目录中的.csv文件 - Python script to read multiple excel files in one directory and convert them to .csv files in another directory Excel 读取多个文件 Java Apache POI - Excel Read Multiple Files Java Apache POI 将 excel 多个文件合并为一个 dataframe - Merge excel files with multiple sheets into one dataframe Python:如何从一个 excel 文件中循环遍历多个工作表并将它们组合成一个 dataframe - Python: How to loop through multiple sheets from one excel file and combine them into one dataframe 从多个 excel 文件中读取并将它们插入到 PosgreSQL DB 中的表中 - Read from multiple excel files and Insert them into a table in PosgreSQL DB Python:如何读取多个文件夹中的所有文本文件内容并将其保存到一个excel文件中 - Python: How to read all text file contents in multiple folders and save them into one excel file 从多个 excel 文件中提取一列,并使用 R 将它们连接到一个文件中 - extract a column from multiple excel files and concatenate them into a single file with R 在 python 中,如何连接多个 excel 文件中的相应工作表 - In python, how to concatenate corresponding sheets in multiple excel files
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM