在数据块中以 stream dataframe 的形式获取 dbfs 文件

Question

我有一个问题，我需要在 Databricks 中为每个登陆 ADLS gen 2 存储的 CSV 文件创建一个外部表。

当我从 dbutils.fs.ls() output 获得一个流 dataframe 然后调用一个 function 在 forEachBatch() 中创建一个表时，我想到了一个解决方案。

我已经准备好 function，但我无法找到将 stream 目录信息转换为流式传输 Dataframe 的方法。有人知道如何实现吗？

Answer 1

请检查以下代码块。

package com.sparkbyexamples.spark.streaming
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}

object SparkStreamingFromDirectory {

  def main(args: Array[String]): Unit = {

    val spark:SparkSession = SparkSession.builder()
      .master("local[3]")
      .appName("SparkByExamples")
      .getOrCreate()

    spark.sparkContext.setLogLevel("ERROR")

    val schema = StructType(
      List(
        StructField("Zipcode", IntegerType, true),
        
      )
    )

    val df = spark.readStream
      .schema(schema)
      .json("Your directory")

    df.printSchema()

    val groupDF = df.select("Zipcode")
        .groupBy("Zipcode").count()
    groupDF.printSchema()

    groupDF.writeStream
      .format("console")
      .outputMode("complete")
      .start()
      .awaitTermination()
  }
}

在数据块中以 stream dataframe 的形式获取 dbfs 文件

问题描述

1 个解决方案

解决方案1
-1 2022-04-04 11:31:50

在数据块中以 stream dataframe 的形式获取 dbfs 文件

问题描述

1 个解决方案

解决方案1 -1 2022-04-04 11:31:50

解决方案1
-1 2022-04-04 11:31:50