使用 Spark Java Excel 从 excel 文件加载数据

Question

I want to load data from an Excel file in HDFS using Spark Session 2.2.我想使用 Spark Session 2.2 从 HDFS 中的 Excel 文件加载数据。 Here is bellow my Java code and the exception I got.下面是我的 Java 代码和我得到的异常。

Dataset<Row> df = 
            session.read().
            format("com.crealytics.spark.excel").
            option("location", pathFile).
            option("sheetName", "Feuil1").
            option("useHeader", "true").
            option("treatEmptyValuesAsNulls", "true").
            option("inferSchema", "true").
            option("addColorColumns", "false").
            load(pathFile);

I got this exception:我得到了这个例外：

java.lang.NoSuchMethodError: org.apache.poi.ss.usermodel.Workbook.close()V at com.crealytics.spark.excel.ExcelRelation.com$crealytics$spark$excel$ExcelRelation$$getExcerpt(ExcelRelation.scala:81) at com.crealytics.spark.excel.ExcelRelation$$anonfun$inferSchema$1.apply(ExcelRelation.scala:270) at com.crealytics.spark.excel.ExcelRelation$$anonfun$inferSchema$1.apply(ExcelRelation.scala:269) at scala.Option.getOrElse(Option.scala:121) at com.crealytics.spark.excel.ExcelRelation.inferSchema(ExcelRelation.scala:269) at com.crealytics.spark.excel.ExcelRelation.(ExcelRelation.scala:97) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:35) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:14) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:8) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:330) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152) java.lang.NoSuchMethodError: org.apache.poi.ss.usermodel.Workbook.close()V at com.crealytics.spark.excel.ExcelRelation.com$crealytics$spark$excel$ExcelRelation$$getExcerpt(ExcelRelation.scala: 81) 在 com.crealytics.spark.excel.ExcelRelation$$anonfun$inferSchema$1.apply(ExcelRelation.scala:270) 在 com.crealytics.spark.excel.ExcelRelation$$anonfun$inferSchema$1.apply(ExcelRelation.scala: 269) at scala.Option.getOrElse(Option.scala:121) at com.crealytics.spark.excel.ExcelRelation.inferSchema(ExcelRelation.scala:269) at com.crealytics.spark.excel.ExcelRelation.(ExcelRelation.scala: 97) 在 com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:35) 在 com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:14) 在 com.crealytics.spark.excel.DefaultSource .createRelation(DefaultSource.scala:8) 在 org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:330) 在 org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152) )

Answer 1

It looks like a dependency issue.看起来像是依赖问题。 Check if in your pom/sbt some of libraries don't use different version of Apache POI.检查您的 pom/sbt 中的某些库是否不使用不同版本的 Apache POI。 You can do it for instance with mvn depenency:tree ( https://maven.apache.org/plugins/maven-dependency-plugin/examples/resolving-conflicts-using-the-dependency-tree.html ) or appropriate SBT/Gradle... command.例如，您可以使用 mvn depenency:tree ( https://maven.apache.org/plugins/maven-dependency-plugin/examples/resolving-conflicts-using-the-dependency-tree.html ) 或适当的 SBT/ Gradle ... 命令。

When you find the conflicting dependency (the one where Workbook.close() method is missing), you can exclude it from the import.当您发现冲突的依赖项（缺少 Workbook.close() 方法的依赖项）时，您可以将其从导入中排除。

Apparently the close() method was added here: https://github.com/apache/poi/commit/47a8f6cf486b974f31ffd694716f424114e647d5显然这里添加了close()方法： https : //github.com/apache/poi/commit/47a8f6cf486b974f31ffd694716f424114e647d5

Answer 2

The problem is: for spark-excel, it only supports scala.问题是：对于spark-excel，它只支持scala。
So if you want to use this dependency, you can compile your scala code to .jar package, then use this package in your java project.所以如果你想使用这个依赖，你可以将你的scala代码编译成.jar包，然后在你的java项目中使用这个包。

使用 Spark Java Excel 从 excel 文件加载数据

问题描述

2 个解决方案

解决方案1
0 已采纳 2018-05-02 17:55:55

解决方案2
0 2021-07-08 06:31:04

使用 Spark Java Excel 从 excel 文件加载数据

问题描述

2 个解决方案

解决方案1 0 已采纳 2018-05-02 17:55:55

解决方案2 0 2021-07-08 06:31:04

解决方案1
0 已采纳 2018-05-02 17:55:55

解决方案2
0 2021-07-08 06:31:04