繁体   English   中英

如何正确地在Apache Spark中加入2个数据框?

[英]How correctly to join 2 dataframe in Apache Spark?

我是Apache Spark的新手,需要一些帮助。 有人可以说出如何正确地加入下两个数据帧吗?

第一个数据框:

| DATE_TIME           | PHONE_NUMBER |
|---------------------|--------------|
| 2019-01-01 00:00:00 | 7056589658   |
| 2019-02-02 00:00:00 | 7778965896   |

第二个数据框:

| DATE_TIME           | IP            |
|---------------------|---------------|
| 2019-01-01 01:00:00 | 194.67.45.126 |
| 2019-02-02 00:00:00 | 102.85.62.100 |
| 2019-03-03 03:00:00 | 102.85.62.100 |

我想要的最终数据框:

| DATE_TIME           | PHONE_NUMBER | IP            |
|---------------------|--------------|---------------|
| 2019-01-01 00:00:00 | 7056589658   |               |
| 2019-01-01 01:00:00 |              | 194.67.45.126 |
| 2019-02-02 00:00:00 | 7778965896   | 102.85.62.100 |
| 2019-03-03 03:00:00 |              | 102.85.62.100 |

下面是我尝试过的代码:

import org.apache.spark.sql.Dataset
import spark.implicits._

val df1 = Seq(
    ("2019-01-01 00:00:00", "7056589658"),
    ("2019-02-02 00:00:00", "7778965896")
).toDF("DATE_TIME", "PHONE_NUMBER")

df1.show()

val df2 = Seq(
    ("2019-01-01 01:00:00", "194.67.45.126"),
    ("2019-02-02 00:00:00", "102.85.62.100"),
    ("2019-03-03 03:00:00", "102.85.62.100")
).toDF("DATE_TIME", "IP")

df2.show()

val total = df1.join(df2, Seq("DATE_TIME"), "left_outer")

total.show()

不幸的是,它引发了错误:

org.apache.spark.SparkException: Exception thrown in awaitResult:
  at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
  at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:136)
  at org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:367)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:144)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:140)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
  at org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:140)
  at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.prepareBroadcast(BroadcastHashJoinExec.scala:135)
...

您需要full outer join ,但是您的代码很好。 您的问题可能是其他问题,但是您提到的堆栈跟踪信息无法得出问题的根源。

val total = df1.join(df2, Seq("DATE_TIME"), "full_outer")

你可以这样做:

val total = df1.join(df2, (df1("DATE_TIME") === df2("DATE_TIME")), "left_outer")

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM