如何在 Apache Spark 中执行 UPSERT 或 MERGE 操作？

Question

我正在尝试使用 Apache Spark 使用唯一列“ID”将记录更新并插入旧的 Dataframe。

Answer 1

为了更新 Dataframe，您可以对唯一列执行“left_anti”连接，然后将其与包含新记录的 Dataframe 合并

def refreshUnion(oldDS: Dataset[_], newDS: Dataset[_], usingColumns: Seq[String]): Dataset[_] = {
    val filteredNewDS = selectAndCastColumns(newDS, oldDS)
    oldDS.join(
      filteredNewDS,
      usingColumns,
      "left_anti")
      .select(oldDS.columns.map(columnName => col(columnName)): _*)
      .union(filteredNewDS.toDF)
  }

  def selectAndCastColumns(ds: Dataset[_], refDS: Dataset[_]): Dataset[_] = {
    val columns = ds.columns.toSet
    ds.select(refDS.columns.map(c => {
      if (!columns.contains(c)) {
        lit(null).cast(refDS.schema(c).dataType) as c
      } else {
        ds(c).cast(refDS.schema(c).dataType) as c
      }
    }): _*)
  }

val df = refreshUnion(oldDS, newDS, Seq("ID"))

Answer 2

Spark Dataframes 是不可变的结构。 因此，您无法根据 ID 进行任何更新。

The way to update dataframe is to merge the older dataframe and the newer dataframe and save the merged dataframe on HDFS. 要更新旧 ID，您需要一些重复数据删除密钥（可能是时间戳）。

我在 scala 中为此添加了示例代码。 您需要使用 uniqueId 和时间戳列名称调用merge function。 时间戳应该是长的。

case class DedupableDF(unique_id: String, ts: Long);

def merge(snapshot: DataFrame)(
      delta: DataFrame)(uniqueId: String, timeStampStr: String): DataFrame = {
    val mergedDf = snapshot.union(delta)
    return dedupeData(mergedDf)(uniqueId, timeStampStr)

  }

def dedupeData(dataFrameToDedupe: DataFrame)(
      uniqueId: String,
      timeStampStr: String): DataFrame = {
    import sqlContext.implicits._

    def removeDuplicates(
        duplicatedDataFrame: DataFrame): Dataset[DedupableDF] = {
      val dedupableDF = duplicatedDataFrame.map(a =>
        DedupableDF(a(0).asInstanceOf[String], a(1).asInstanceOf[Long]))
      val mappedPairRdd =
        dedupableDF.map(row ⇒ (row.unique_id, (row.unique_id, row.ts))).rdd;
      val reduceByKeyRDD = mappedPairRdd
        .reduceByKey((row1, row2) ⇒ {
          if (row1._2 > row2._2) {
            row1
          } else {
            row2
          }
        })
        .values;
      val ds = reduceByKeyRDD.toDF.map(a =>
        DedupableDF(a(0).asInstanceOf[String], a(1).asInstanceOf[Long]))
      return ds;
    }

    /** get distinct unique_id, timestamp combinations **/
    val filteredData =
      dataFrameToDedupe.select(uniqueId, timeStampStr).distinct

    val dedupedData = removeDuplicates(filteredData)

    dataFrameToDedupe.createOrReplaceTempView("duplicatedDataFrame");
    dedupedData.createOrReplaceTempView("dedupedDataFrame");

    val dedupedDataFrame =
      sqlContext.sql(s""" select distinct duplicatedDataFrame.*
                  from duplicatedDataFrame
                  join dedupedDataFrame on
                  (duplicatedDataFrame.${uniqueId} = dedupedDataFrame.unique_id
                  and duplicatedDataFrame.${timeStampStr} = dedupedDataFrame.ts)""")
    return dedupedDataFrame
  }

如何在 Apache Spark 中执行 UPSERT 或 MERGE 操作？

问题描述

2 个解决方案

解决方案1
0 已采纳 2019-11-11 07:38:59

解决方案2
0 2019-11-11 08:38:08

如何在 Apache Spark 中执行 UPSERT 或 MERGE 操作？

问题描述

2 个解决方案

解决方案1 0 已采纳 2019-11-11 07:38:59

解决方案2 0 2019-11-11 08:38:08

解决方案1
0 已采纳 2019-11-11 07:38:59

解决方案2
0 2019-11-11 08:38:08