简体   繁体   English

使用scala从excel构建数据框

[英]Construct a dataframe from excel using scala

I am looking for way to construct the dataframe from an excel file in spark using scala?我正在寻找使用scala从spark中的excel文件构造数据框的方法? I referred below SO post and tried doing an operation for an excel sheet attached.我在下面的SO帖子中提到并尝试对附加的excel表进行操作。

Excel表格样本

How to construct Dataframe from a Excel (xls,xlsx) file in Scala Spark? 如何从 Scala Spark 中的 Excel (xls,xlsx) 文件构造数据框?

Unfortunately, below modified code didn't read all the columns in an excel.不幸的是,下面修改过的代码没有读取 excel 中的所有列。

val df = spark.read.format("com.crealytics.spark.excel")
      .option("sheetName", "Sheet1") // Required
      .option("useHeader", "false") // Required
      .option("treatEmptyValuesAsNulls", "false") // Optional, default: true
      .option("inferSchema", "true") // Optional, default: false
      .option("addColorColumns", "false") // Optional, default: false
      .option("startColumn", 0) // Optional, default: 0
      .option("endColumn", 99) // Optional, default: Int.MaxValue
      .option("timestampFormat", "MM-dd-yyyy HH:mm:ss") // Optional, default: yyyy-mm-dd hh:mm:ss[.fffffffff]
      .option("maxRowsInMemory", 20) // Optional, default None. If set, uses a streaming reader which can help with big files
      .option("excerptSize", 10) // Optional, default: 10. If set and if schema inferred, number of rows to infer schema from
      .option("path", excelFile)
      //.schema(customSchema)
      .load()

+---+---+--------------+---+---+
|_c0|_c1|           _c2|_c3|_c4|
+---+---+--------------+---+---+
|   |   |Test Profile 1|  A|123|
|   |   |Test Profile 2|  B|   |
|   |   |Test Profile 3|  C|   |
|   |   |Test Profile 4|  D|   |
|   |   |Test Profile 5|  E|   |
|   |   |Test Profile 6|  F|   |
+---+---+--------------+---+---+

Am I missing anything here?我在这里错过了什么吗?

My objective is to get all the data from a sheet which is randomly distributed and then get specific values out of it.我的目标是从随机分布的工作表中获取所有数据,然后从中获取特定值。 Some of the cells can be blank.某些单元格可以为空。

I can do it in scala using apache poi, get the required values, convert into csv and then load in dataframe.我可以使用 apache poi 在 Scala 中完成,获取所需的值,转换为 csv,然后加载到数据帧中。

However, I am looking for a way to parse the excel sheet directly into dataframe using scala, iterate through dataframe rows and apply conditions to get the required rows/columns.但是,我正在寻找一种使用 Scala 将 Excel 工作表直接解析为数据框的方法,遍历数据框行并应用条件以获取所需的行/列。

ps Sorry, I didnt know how to attach an excel file from my local machine. ps 抱歉,我不知道如何从本地计算机附加 excel 文件。

Thanks!谢谢!

If you study the source code of crealytics spark excel , you will find that column numbers are defined with first row with value.如果你研究了crealytics spark excel源代码,你会发现列号是用第一行的值定义的。 And the first row with value in your excel file has file columns, so the last column which has value in other columns and not in the first row with value is neglected.并且您的 excel 文件中具有值的第一行具有文件列,因此在其他列中具有值而不在具有值的第一行中的最后一列被忽略。

The solution to this would be to define a custom schema and pass it to to framework as对此的解决方案是定义一个自定义模式并将其传递给框架作为

val customSchema = StructType(Seq(
  StructField("col0", StringType, true),
  StructField("col1", StringType, true),
  StructField("col2", StringType, true),
  StructField("col3", StringType, true),
  StructField("col4", IntegerType, true),
  StructField("col5", IntegerType, true)
  ))
val df = spark.read.format("com.crealytics.spark.excel")
  .option("sheetName", "Sheet1") // Required
  .option("useHeader", "false") // Required
  .option("treatEmptyValuesAsNulls", "false") // Optional, default: true
  .option("inferSchema", "true") // Optional, default: false
  .option("addColorColumns", "false") // Optional, default: false
  .option("startColumn", 0) // Optional, default: 0
  .option("endColumn", 99) // Optional, default: Int.MaxValue
  .option("timestampFormat", "MM-dd-yyyy HH:mm:ss") // Optional, default: yyyy-mm-dd hh:mm:ss[.fffffffff]
  .option("maxRowsInMemory", 20) // Optional, default None. If set, uses a streaming reader which can help with big files
  .option("excerptSize", 10) // Optional, default: 10. If set and if schema inferred, number of rows to infer schema from
  .option("path", excelFile)
  .schema(customSchema)
  .load()

and you should have following dataframe你应该有以下dataframe

+----+----+--------------+----+----+----+
|col0|col1|col2          |col3|col4|col5|
+----+----+--------------+----+----+----+
|null|null|Test Profile 1|A   |123 |null|
|null|null|Test Profile 2|B   |null|null|
|null|null|Test Profile 3|C   |null|345 |
|null|null|Test Profile 4|D   |null|null|
|null|null|Test Profile 5|E   |null|null|
|null|null|Test Profile 6|F   |null|null|
+----+----+--------------+----+----+----+

I hope the answer is helpful我希望答案有帮助

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM