简体   繁体   English

如何在 spark scala 中读取多个镶木地板文件

[英]How can I read multiple parquet files in spark scala

Below are some folders, which might keep updating with time.以下是一些文件夹,它们可能会随着时间的推移而不断更新。 They have multiple.parquet files.他们有多个.parquet 文件。 How can I read them in a Spark dataframe in scala?如何在 scala 的 Spark dataframe 中读取它们?

  • "id=200393/date=2019-03-25" “id=200393/日期=2019-03-25”
  • "id=200393/date=2019-03-26" “id=200393/日期=2019-03-26”
  • "id=200393/date=2019-03-27" “id=200393/日期=2019-03-27”
  • "id=200393/date=2019-03-28" “id=200393/日期=2019-03-28”
  • "id=200393/date=2019-03-29" and so on... “id=200393/date=2019-03-29”等等……

Note:- There could be 100 date folders, I need to pick only specific(let's say for 25,26 and 28)注意:- 可能有 100 个日期文件夹,我只需要选择特定的(比如说 25,26 和 28)

Is there any better way than below?有没有比下面更好的方法?

import org.apache.spark._
import org.apache.spark.SparkContext._
import org.apache.spark.sql._

val spark = SparkSession.builder.appName("ScalaCodeTest").master("yarn").getOrCreate()
val parquetFiles = List("id=200393/date=2019-03-25", "id=200393/date=2019-03-26", "id=200393/date=2019-03-28")

spark.read.format("parquet").load(parquetFiles: _*)

The above code is working but I want to do something like below-上面的代码正在工作,但我想做如下的事情 -

val parquetFiles = List()
parquetFiles(0) = "id=200393/date=2019-03-25"
parquetFiles(1) = "id=200393/date=2019-03-26"
parquetFiles(2) = "id=200393/date=2019-03-28"
spark.read.format("parquet").load(parquetFiles: _*)

you can read it this way to read all folders in a directory id=200393:您可以通过这种方式读取它以读取目录 id=200393 中的所有文件夹:

val df  = spark.read.parquet("id=200393/*")

If you want to select only some dates, for example only september 2019:如果您只想 select 某些日期,例如仅 2019 年 9 月:

val df  = spark.read.parquet("id=200393/2019-09-*")

If you have some special days, you can have the list of days in a list如果您有一些特殊的日子,您可以在列表中列出日期

  val days = List("2019-09-02", "2019-09-03")
  val paths = days.map(day => "id=200393/" ++ day)
  val df = spark.read.parquet(paths:_*)

If you want to keep the column 'id', you could try this:如果你想保留列“id”,你可以试试这个:

val df = sqlContext
     .read
     .option("basePath", "id=200393/")
     .parquet("id=200393/date=*")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 读取 Parquet 文件 Spark Scala - Read Parquet files Spark Scala 在不使用Spark的情况下从Scala读取Parquet文件 - Read Parquet files from Scala without using Spark 如何使用火花流读取.csv文件并使用Scala写入镶木地板文件? - How to read .csv files using spark streaming and write to parquet file using Scala? 如何将多个二进制文件合并到Spark中的镶木地板中? - How to merge multiple binary files into parquet in spark? Scala ::读取具有不同架构信息的多个镶木地板文件 - Scala :: Read multiple parquet files with different schema information Scala:如何返回镶木地板文件的选项[Dataframe](位于adls中)-不使用spark / sql会话 - Scala: How can I return Option[Dataframe] of parquet file(which is in adls) - without using spark/sql session 使用Scala读取文件夹中的多个文件以执行Spark作业 - Read multiple files in a folder with Scala for a Spark job 如何在 Spark scala 中动态读取文本文件(字符串类型数据)map 并将数据加载为镶木地板格式(具有不同数据类型的多列) - How to read from textfile(String type data) map and load data into parquet format(multiple columns with different datatype) in Spark scala dynamically 如何删除 spark scala 中的镶木地板文件? - How do I delete a parquet file in spark scala? 如何使用Spark&Scala读取此avro文件? - How can I read this avro file using spark & scala?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM