使用 apache spark 读取 excel 文件

Question

(new to apache spark) （新到 apache 火花）

I tried to create a small Scala Spark app which read excel files and insert data into database, but I have some errors which are occured due of different library versions (I think).我尝试创建一个小型 Scala Spark 应用程序，它读取 excel 文件并将数据插入数据库，但由于库版本不同（我认为），我有一些错误。

Scala v2.12 
Spark v3.0 
Spark-Excel v0.13.1

Maven configuration is: Maven配置为：

    <dependencies>
            <dependency>
                <groupId>org.apache.spark</groupId>
                <artifactId>spark-core_2.12</artifactId>
                <version>3.0.0</version>
            </dependency>
            <dependency>
                <groupId>org.apache.spark</groupId>
                <artifactId>spark-sql_2.12</artifactId>
                <version>3.0.0</version>
            </dependency>
            <!-- https://mvnrepository.com/artifact/com.crealytics/spark-excel -->
            <dependency>
                <groupId>com.crealytics</groupId>
                <artifactId>spark-excel_2.12</artifactId>
                <version>0.13.1</version>
            </dependency>
            <!-- https://mvnrepository.com/artifact/com.fasterxml.jackson.core/jackson-core -->
            <dependency>
                <groupId>com.fasterxml.jackson.core</groupId>
                <artifactId>jackson-core</artifactId>
                <version>2.11.1</version>
            </dependency>
        </dependencies>

Main.scala主.scala

        val spark = SparkSession
            .builder
            .appName("SparkApp")
            .master("local[*]")
            .config("spark.sql.warehouse.dir", "file:///C:/temp") // Necessary to work around a Windows bug in Spark 2.0.0; omit if you're not on Windows.
        .getOrCreate()
 
        val path = "file_path"
        val excel = spark.read
          .format("com.crealytics.spark.excel")
          .option("useHeader", "true")
          .option("treatEmptyValuesAsNulls", "false")
          .option("inferSchema", "false")
          .option("location", path)
          .option("addColorColumns", "false")
          .load()
    
        println(s"excel count is ${excel.count}")

Error is:错误是：

Exception in thread "main" scala.MatchError: Map(treatemptyvaluesasnulls -> false, location -> file_path, useheader -> true, inferschema -> false, addcolorcolumns -> false) (of class org.apache.spark.sql.catalyst.util.CaseInsensitiveMap) 
    at com.crealytics.spark.excel.WorkbookReader$.apply(WorkbookReader.scala:38) 
    at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:28) 
    at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:18) 
    at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:12) 
    at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:339) 
    at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:279) 
    at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:268) 
    at scala.Option.getOrElse(Option.scala:189)     at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:268) 
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:203) 
    at main.scala.Main$.main(Main.scala:42) 
    at main.scala.Main.main(Main.scala)

This happening only when I try to read excel files because I use spark-excel library.仅当我尝试读取 excel 文件时才会发生这种情况，因为我使用 spark-excel 库。 Csv or tsv works fine. Csv 或 tsv 工作正常。

Answer 1

I think, you forgot specifying the excel in load like spark.read....load("Worktime.xlsx")我想，您忘记在spark.read....load("Worktime.xlsx")之类的load中指定 excel

Sample example -示例示例 -

val df = spark.read
    .format("com.crealytics.spark.excel")
    .option("dataAddress", "'My Sheet'!B3:C35") // Optional, default: "A1"
    .option("header", "true") // Required
    .option("treatEmptyValuesAsNulls", "false") // Optional, default: true
    .option("inferSchema", "false") // Optional, default: false
    .option("addColorColumns", "true") // Optional, default: false
    .option("timestampFormat", "MM-dd-yyyy HH:mm:ss") // Optional, default: yyyy-mm-dd hh:mm:ss[.fffffffff]
    .option("maxRowsInMemory", 20) // Optional, default None. If set, uses a streaming reader which can help with big files
    .option("excerptSize", 10) // Optional, default: 10. If set and if schema inferred, number of rows to infer schema from
    .option("workbookPassword", "pass") // Optional, default None. Requires unlimited strength JCE for older JVMs
    .schema(myCustomSchema) // Optional, default: Either inferred schema, or all columns are Strings
    .load("Worktime.xlsx")

Ref- readme参考自述文件

Answer 2

I know that this doesn't answer directly your questions, but this may still help your in solving your issue.我知道这并不能直接回答您的问题，但这仍然可以帮助您解决问题。

You can use the pandas package from python.您可以使用来自 python 的 pandas package。
read in the excel file with pandas and python使用 pandas 和 python 读取 excel 文件
convert the pandas dataframe to spark dataframe将 pandas dataframe 转换为火花 dataframe
save with pyspark as parquet/hive table与 pyspark 一起保存为镶木地板/蜂巢表
load data with scala&spark使用 scala&spark 加载数据

使用 apache spark 读取 excel 文件

问题描述

2 个解决方案

解决方案1
2 已采纳 2020-07-08 09:57:54

解决方案2
0 2020-07-08 09:50:52

使用 apache spark 读取 excel 文件

问题描述

2 个解决方案

解决方案1 2 已采纳 2020-07-08 09:57:54

解决方案2 0 2020-07-08 09:50:52

解决方案1
2 已采纳 2020-07-08 09:57:54

解决方案2
0 2020-07-08 09:50:52