简体   繁体   English

如何将火花 scala map 字段合并到 BQ?

[英]How to incorporate spark scala map field to BQ?

I am writing a spark scala code to write the output to BQ, The following are the codes used for forming the output table which has two columns (id and keywords)我正在编写一个 spark scala 代码来将 output 写入 BQ,以下是用于形成 output 表的代码,该表有两列(id 和关键字)

val df1 = Seq("tamil", "telugu", "hindi").toDF("language")

val df2 = Seq(
  (101, Seq("tamildiary", "tamilkeyboard", "telugumovie")),
  (102, Seq("tamilmovie")),
  (103, Seq("hindirhymes", "hindimovie"))
).toDF("id", "keywords")

val pattern = concat(lit("^"), df1("language"), lit(".*"))

import org.apache.spark.sql.Row

val arrayToMap = udf{ (arr: Seq[Row]) =>
  arr.map{ case Row(k: String, v: Int) => (k, v) }.toMap
}

val final_df = df2.
  withColumn("keyword", explode($"keywords")).as("df2").
  join(df1.as("df1"), regexp_replace($"df2.keyword", pattern, lit("")) =!= $"df2.keyword").
  groupBy("id", "language").agg(size(collect_list($"language")).as("count")).
  groupBy("id").agg(arrayToMap(collect_list(struct($"language", $"count"))).as("keywords"))

The output of final_df is: final_df 的 output 为:

+---+--------------------+                                                      
| id|        app_language|
+---+--------------------+
|101|Map(tamil -> 2, t...|
|103|     Map(hindi -> 2)|
|102|     Map(tamil -> 1)|
+---+--------------------+

I am defining the below function to pass the schema for this output table.我正在定义下面的 function 来传递这个 output 表的模式。 (Since BQ doesn't support map field, I am using array of struct. But this is also not working) (由于 BQ 不支持 map 字段,我正在使用结构数组。但这也不起作用)

  def createTableIfNotExists(outputTable: String) = {

    spark.createBigQueryTable(
      s"""
         |CREATE TABLE IF NOT EXISTS $outputTable(
         |ds date,
         |id int64,
         |keywords ARRAY<STRUCT<key STRING, value INT64>>
         |)
         |PARTITION BY ds
         |CLUSTER BY user_id
       """.stripMargin)
    
  }

Could anyone please help me in writing a correct schema for this so that it's compatible in BQ.任何人都可以帮助我为此编写一个正确的模式,以便它在 BQ 中兼容。

You can collect an array of struct as below:您可以收集一个结构数组,如下所示:

val final_df = df2
    .withColumn("keyword", explode($"keywords")).as("df2")
    .join(df1.as("df1"), regexp_replace($"df2.keyword", pattern, lit("")) =!= $"df2.keyword")
    .groupBy("id", "language")
    .agg(size(collect_list($"language")).as("count"))
    .groupBy("id")
    .agg(collect_list(struct($"language", $"count")).as("app_language"))

final_df.show(false)
+---+-------------------------+
|id |app_language             |
+---+-------------------------+
|101|[[tamil, 2], [telugu, 1]]|
|103|[[hindi, 2]]             |
|102|[[tamil, 1]]             |
+---+-------------------------+

final_df.printSchema
root
 |-- id: integer (nullable = false)
 |-- app_language: array (nullable = true)
 |    |-- element: struct (containsNull = false)
 |    |    |-- language: string (nullable = true)
 |    |    |-- count: integer (nullable = false)

And then you can have a schema like然后你可以有一个像

def createTableIfNotExists(outputTable: String) = {

    spark.createBigQueryTable(
      s"""
         |CREATE TABLE IF NOT EXISTS $outputTable(
         |ds date,
         |id int64,
         |keywords ARRAY<STRUCT<language STRING, count INT64>>
         |)
         |PARTITION BY ds
         |CLUSTER BY user_id
       """.stripMargin)
    
  }

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM