将Spark DataFrame架构转换为新架构

Question

我有多个从不同来源读取的spark作业，它们有不同的模式但它们非常接近，我想要做的就是将它们全部写入同一个Redshift表中，所以我需要统一所有的DataFrame模式，什么是最好的方法吗？

假设第一个输入数据的模式如下：

  val schema1 = StructType(Seq(
    StructField("date", DateType),
    StructField("campaign_id", StringType),
    StructField("campaign_name", StringType),
    StructField("platform", StringType),
    StructField("country", StringType),
    StructField("views", DoubleType),
    StructField("installs", DoubleType),
    StructField("spend", DoubleType)
  ))

seconf inout source的Schema如下：

  val schema2 = StructType(Seq(
    StructField("date", DateType),
    StructField("creator_id", StringType),
    StructField("creator_name", StringType),
    StructField("platform", StringType),
    StructField("views", DoubleType),
    StructField("installs", DoubleType),
    StructField("spend", DoubleType),
    StructField("ecpm", DoubleType)
  ))

表架构（预期的统一数据框架）：

  val finalSchema = StructType(Seq(
    StructField("date", DateType),
    StructField("account_name", StringType),
    StructField("adset_id", StringType),
    StructField("adset_name", StringType),
    StructField("campaign_id", StringType),
    StructField("campaign_name", StringType),
    StructField("pub_id", StringType),
    StructField("pub_name", StringType),
    StructField("creative_id", StringType),
    StructField("creative_name", StringType),
    StructField("platform", StringType),
    StructField("install_source", StringType),
    StructField("views", IntegerType),
    StructField("clicks", IntegerType),
    StructField("installs", IntegerType),
    StructField("cost", DoubleType)
  ))

正如您在最终模式中看到的那样，我有一些列可能不在输入模式中，所以它应该为null，某些列名也应该重命名。 像ecpm这样的一些专栏应该被删除。

Answer 1

将index columns添加到两个dataframes并根据index join它们，这样就会有一对一的映射。 之后，只从joined dataframe select所需的columns 。

如果您有两个dataframes如下所示

 // df1.show +-----+---+ | name|age| +-----+---+ |Alice| 25| | Bob| 29| | Tom| 26| +-----+---+ //df2.show +--------+-------+ | city|country| +--------+-------+ | Delhi| India| |New York| USA| | London| UK| +--------+-------+

现在添加index columns并获得一对一映射

 import org.apache.spark.sql.functions._ val df1Index=df1.withColumn("index1",monotonicallyIncreasingId) val df2Index=df2.withColumn("index2",monotonicallyIncreasingId) val joinedDf=df1Index.join(df2Index,df1Index("index1")===df2Index("index2")) //joinedDf +-----+---+------+--------+-------+------+ | name|age|index1| city|country|index2| +-----+---+------+--------+-------+------+ |Alice| 25| 0| Delhi| India| 0| | Bob| 29| 1|New York| USA| 1| | Tom| 26| 2| London| UK| 2| +-----+---+------+--------+-------+------+

现在你可以写下如下查询

val queryList=List(col("name"),col("age"),col("country"))
joinedDf.select(queryList:_*).show

//Output df
+-----+---+-------+
| name|age|country|
+-----+---+-------+
|Alice| 25|  India|
|  Bob| 29|    USA|
|  Tom| 26|     UK|
+-----+---+-------+

Answer 2

不确定是否有全自动的方法来实现这一目标。 如果你的模式是固定的，不是特别复杂，你可以手动调整模式和union的结果。

为了讨论的方便，让我们说你要包含列col1和col2从frame1和包括col2和col4的frame2 。

import org.apache.spark.sql.functions._

val subset1 = frame1.select($"col1", $"col2", lit(null).as("col4"))
val subset2 = frame2.select(lit(null).as("col1"), $"col2", $"col4")
val result = subset1 union subset2

实现这一目标。 我们手动指定每列，以便我们可以跳过任何我们喜欢的列。

将Spark DataFrame架构转换为新架构

问题描述

2 个解决方案

解决方案1
1 2018-08-01 17:28:59

解决方案2
0 2018-08-01 19:00:47

将Spark DataFrame架构转换为新架构

问题描述

2 个解决方案

解决方案1 1 2018-08-01 17:28:59

解决方案2 0 2018-08-01 19:00:47

解决方案1
1 2018-08-01 17:28:59

解决方案2
0 2018-08-01 19:00:47