简体   繁体   English

如何使用 Scala Spark 快速处理数百万个小型 JSON 文件?

[英]How to process millions of small JSON files quickly using Scala Spark?

I have to process millions of JSON files from Azure Blob Storage, each representing one row, and need to load them into Azure SQL DB with some minimal transformation in between.我必须处理来自 Azure Blob 存储的数百万个 JSON 文件,每个文件代表一行,并且需要将它们加载到 Azure SQL DB 中,并在两者之间进行一些最小的转换。 These files come in at random times but follow the same schema.这些文件随机出现,但遵循相同的模式。

My first solution basically just created a DataFrame for each file and pushed it into SQL.我的第一个解决方案基本上只是为每个文件创建了一个 DataFrame 并将其推送到 SQL 中。 This worked when we were receiving hundreds of files but now that we are received millions of files it is not scaling, taking over a day to process.这在我们接收数百个文件时有效,但现在我们收到了数百万个文件,它没有扩展,需要一天的时间来处理。

We also tried processing the files in Scala without Spark (see code below) but this is also too slow;我们还尝试在没有 Spark 的情况下在 Scala 中处理文件(见下面的代码),但这也太慢了; 500 files processed in 8 minutes. 8 分钟内处理 500 个文件。

var sql_statement = ""
allFiles.par.map(file_name => {
      //processing
      val json = scala.io.Source.fromFile(file_name).mkString
      val mapData1 = mapper.readValue(json, classOf[Map[String, Any]])
      val account=  mapData1("Contact").asInstanceOf[Map[String, Any]]
      val common = account.keys.toList.intersect(srcDestMap .keys.toList)
      val trMap=common.map(rec=>Map(srcDestMap(rec)->account(rec))).flatten.toMap
      val vals=trMap.keys.toList.sorted.map(trMap(_).toString.replace("'", "''")).map("'"+_+"'")
      //end processing

      val cols="insert into dbo.Contact_VS(" + trMap.keys.toList.sorted.mkString(",") + ")" + " values (" + vals.mkString(",") + ")"
      sql_statement = sql_statement + cols
    })
      val updated=statement.executeUpdate(sql_statement)
      connection.close()

If anyone knows how to optimize this code, or any out-of-the-box thinking we could use to preprocess our JSON it would be greatly appreciated!如果有人知道如何优化这段代码,或者任何我们可以用来预处理我们的 JSON 的开箱即用的想法,我们将不胜感激! The JSON is nested so it's a little more involved to merge everything into one large JSON to be read into Spark but we may have to go that way if no one has any better ideas. JSON 是嵌套的,因此将所有内容合并到一个大的 JSON 以读取到 Spark 中需要更多的工作,但如果没有人有更好的想法,我们可能不得不这样做。

You are close spark contains some helper functions to parallelize tasks across the cluster.你很接近 spark 包含一些帮助函数来并行化集群中的任务。 Note you will want to set "spark.default.parallelism" to a sane number such that you're not creating too many connections to your DB.请注意,您需要将“spark.default.parallelism”设置为一个合理的数字,这样您就不会创建与数据库的太多连接。

  def loadFileAndUploadToRDS(filepath: String): Unit = ???

  @Test
  def parallelUpload(): Unit ={
    val files = List("s3://bucket/path" /** more files **/)
    spark.sparkContext.parallelize(files).foreach(filepath => loadFileAndUploadToRDS(filepath))
  }

Since you already got an answer let me point some problems with the raw scala implementation:既然您已经得到了答案,那么让我指出原始 Scala 实现的一些问题:

1) creating sql requests manually is error-prone and inefficient 1)手动创建sql请求容易出错且效率低下

2) updating sql_statement in a loop is very inefficient 2)循环更新sql_statement效率很低

3) level of parallelism of allFiles.par . 3) allFiles.par的并行度。 .par shouldn't be used for blocking tasks for two reasons: .par不应该用于阻塞任务,原因有两个:

  • it uses global shared thread pool under the hood so one batch of tasks will block other tasks.它在后台使用全局共享线程池,因此一批任务将阻塞其他任务。

  • parallelism level is optimized for cpu-bound tasks (number of CPU threads).并行级别针对 cpu 绑定任务(CPU 线程数)进行了优化。 You want much higher parallelism.你想要更高的并行度。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM