繁体   English   中英

如何在 Apache Spark 中执行 UPSERT 或 MERGE 操作?

[英]How to perform UPSERT or MERGE operation in Apache Spark?

我正在尝试使用 Apache Spark 使用唯一列“ID”将记录更新并插入旧的 Dataframe。

为了更新 Dataframe,您可以对唯一列执行“left_anti”连接,然后将其与包含新记录的 Dataframe 合并

def refreshUnion(oldDS: Dataset[_], newDS: Dataset[_], usingColumns: Seq[String]): Dataset[_] = {
    val filteredNewDS = selectAndCastColumns(newDS, oldDS)
    oldDS.join(
      filteredNewDS,
      usingColumns,
      "left_anti")
      .select(oldDS.columns.map(columnName => col(columnName)): _*)
      .union(filteredNewDS.toDF)
  }

  def selectAndCastColumns(ds: Dataset[_], refDS: Dataset[_]): Dataset[_] = {
    val columns = ds.columns.toSet
    ds.select(refDS.columns.map(c => {
      if (!columns.contains(c)) {
        lit(null).cast(refDS.schema(c).dataType) as c
      } else {
        ds(c).cast(refDS.schema(c).dataType) as c
      }
    }): _*)
  }

val df = refreshUnion(oldDS, newDS, Seq("ID"))

Spark Dataframes 是不可变的结构。 因此,您无法根据 ID 进行任何更新。

The way to update dataframe is to merge the older dataframe and the newer dataframe and save the merged dataframe on HDFS. 要更新旧 ID,您需要一些重复数据删除密钥(可能是时间戳)。

我在 scala 中为此添加了示例代码。 您需要使用 uniqueId 和时间戳列名称调用merge function。 时间戳应该是长的。

case class DedupableDF(unique_id: String, ts: Long);

def merge(snapshot: DataFrame)(
      delta: DataFrame)(uniqueId: String, timeStampStr: String): DataFrame = {
    val mergedDf = snapshot.union(delta)
    return dedupeData(mergedDf)(uniqueId, timeStampStr)

  }

def dedupeData(dataFrameToDedupe: DataFrame)(
      uniqueId: String,
      timeStampStr: String): DataFrame = {
    import sqlContext.implicits._

    def removeDuplicates(
        duplicatedDataFrame: DataFrame): Dataset[DedupableDF] = {
      val dedupableDF = duplicatedDataFrame.map(a =>
        DedupableDF(a(0).asInstanceOf[String], a(1).asInstanceOf[Long]))
      val mappedPairRdd =
        dedupableDF.map(row ⇒ (row.unique_id, (row.unique_id, row.ts))).rdd;
      val reduceByKeyRDD = mappedPairRdd
        .reduceByKey((row1, row2) ⇒ {
          if (row1._2 > row2._2) {
            row1
          } else {
            row2
          }
        })
        .values;
      val ds = reduceByKeyRDD.toDF.map(a =>
        DedupableDF(a(0).asInstanceOf[String], a(1).asInstanceOf[Long]))
      return ds;
    }

    /** get distinct unique_id, timestamp combinations **/
    val filteredData =
      dataFrameToDedupe.select(uniqueId, timeStampStr).distinct

    val dedupedData = removeDuplicates(filteredData)

    dataFrameToDedupe.createOrReplaceTempView("duplicatedDataFrame");
    dedupedData.createOrReplaceTempView("dedupedDataFrame");

    val dedupedDataFrame =
      sqlContext.sql(s""" select distinct duplicatedDataFrame.*
                  from duplicatedDataFrame
                  join dedupedDataFrame on
                  (duplicatedDataFrame.${uniqueId} = dedupedDataFrame.unique_id
                  and duplicatedDataFrame.${timeStampStr} = dedupedDataFrame.ts)""")
    return dedupedDataFrame
  }

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM