简体   繁体   English

加入spark后有效统计记录

[英]Effectively counting records after join in spark

This is what I am doing.这就是我正在做的。 I need to get number of records present in one dataset and not the other and then again join with a third dataset to get some other columns.我需要获取一个数据集中存在的记录数,而不是另一个数据集,然后再次加入第三个数据集以获取其他一些列。

val tooCompare = dw
        .select(
          "loc",
          "id",
          "country",
          "region"
        ).dropDuplicates()

val previous = dw
        .select(
          "loc",
          "id",
          "country",
          "region"
        ).dropDuplicates()

val delta = tooCompare.exceptAll(previous).cache()
 
val records = delta
        .join(
          dw,//another dataset
          delta
            .col("loc").equalTo(dw.col("loc"))
            .and(delta.col("id").equalTo(dw.col("id")))
            .and(delta.col("country").equalTo(dw.col("country")))
            .and(delta.col("region").equalTo(dw.col("region")))
        )
        .drop(delta.col("loc"))
        .drop(delta.col("id"))
        .drop(delta.col("country"))
        .drop(delta.col("region"))
        .cache()
    }

 val recordsToSend = records.cache()
 val count = recordsToSend.select("loc").distinct().count()

Is there a more efficient way to do this?有没有更有效的方法来做到这一点? I am new to Spark.我是 Spark 的新手。 I am pretty sure I am missing something here我很确定我在这里遗漏了一些东西

I would suggest using SQL to make this more readable.我建议使用 SQL 使其更具可读性。

First, create Temp Views of the dataframes in question.首先,创建相关数据帧的临时视图。 Don't know exactly what data frames you have, so something like不知道你有什么数据帧,所以像

dfToCompare.createOrReplaceTempView("toCompare")
previousDf.createOrReplaceTempView("previous")
anotherDataSet.createOrReplaceTempView("another")

Then you can proceed to do all your opertions in one SQL statement然后,您可以继续在一条 SQL 语句中完成所有操作

val records = spark.sql("""select loc, id, country,region
              from toCompare c
              inner join another a
               on a.loc = c.loc  
                and a.id = p.id
                and a.country = c.country
                and a.region = c.region
             where not exists (select null
                                from previous p
                                where p.loc = c.loc  
                                 and p.id = p.id
                                 and p.country = c.country
                                 and p.region = c.region""")

Then you can proceed as before...然后你可以像以前一样继续......

val recordsToSend = records.cache()
val count = recordsToSend.select("loc").distinct().count()

I think there's potentially some errors in the code you've pasted as tooCompare and previous are the same, + the third dataset join references deAnon but dw on the table....我认为您粘贴的代码中可能存在一些错误,因为 tooCompare 和以前的代码相同,+ 第三个数据集连接引用了 deAnon 但表上的 dw ....

For this example answer, assume your current table is called "current", previous is called "previous" and third table is "extra".对于此示例答案,假设您的当前表称为“当前”,前一个称为“上一个”,第三个表称为“额外”。 Then:然后:

val delta = current.join(
              previous, 
              Seq("loc","id","country","region"), 
              "leftanti"
            ).select("loc","id","country","region").distinct

val recordsToSend = delta
                    .join(
                      extra,
                      Seq("loc", "id", "country", "region")
                    )

val count = recordsToSend.select("loc").distinct().count()

This may be more efficient, but I'd appreciate you commenting as to whether it actually was!这可能更有效,但我很感激你评论它是否真的有效!

Just as an aside: note that I'm using the Seq[String] as a join argument (this requires the column names to be identical on both tables, and won't produce two copies of the columns).顺便说一句:请注意,我使用 Seq[String] 作为连接参数(这要求两个表上的列名相同,并且不会生成列的两个副本)。 However, your original join logic can be written a bit more succinctly, as follows (using my naming conventions):但是,您的原始连接逻辑可以写得更简洁一些,如下(使用我的命名约定):

val recordsToSend = delta
                    .join(
                      extra,
                      delta("loc") === extra("loc")
                        && delta("id") === extra("id")
                        && delta("country") === extra("country")
                        && delta("region") === extra("region")
                    )
                    .drop(delta("loc"))
                    .drop(delta("id"))
                    .drop(delta("country"))
                    .drop(delta("region"))

Even better would be to write a drop function that lets you provide more than one column, but I'm going really off topic now ;-)更好的是编写一个 drop 函数,让您提供多个列,但我现在真的离题了 ;-)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM