简体   繁体   English

如何联接两个数据框?

[英]How to join two dataframes?

I cannot get Sparks DataFrame join to work (no result gets produced). 我无法使Sparks DataFrame加入工作(没有产生结果)。 Here is my code: 这是我的代码:

val e = Seq((1, 2), (1, 3), (2, 4))
var edges = e.map(p => Edge(p._1, p._2)).toDF()
var filtered = edges.filter("start = 1").distinct()
println("filtered")
filtered.show()
filtered.printSchema()
println("edges")
edges.show()
edges.printSchema()
var joined = filtered.join(edges, filtered("end") === edges("start"))//.select(filtered("start"), edges("end"))
println("joined")
joined.show()

It requires case class Edge(start: Int, end: Int) to be defined at top level. 它要求在顶级定义case class Edge(start: Int, end: Int) Here is the output it produces: 这是它产生的输出:

filtered
+-----+---+
|start|end|
+-----+---+
|    1|  2|
|    1|  3|
+-----+---+

root
 |-- start: integer (nullable = false)
 |-- end: integer (nullable = false)

edges
+-----+---+
|start|end|
+-----+---+
|    1|  2|
|    1|  3|
|    2|  4|
+-----+---+

root
 |-- start: integer (nullable = false)
 |-- end: integer (nullable = false)

joined
+-----+---+-----+---+
|start|end|start|end|
+-----+---+-----+---+
+-----+---+-----+---+

I don't understand why the output is empty. 我不明白为什么输出为空。 Why isn't the first row of filtered get combined with the last row of edges ? 为什么filtered的第一行与edges的最后一行不合并?

val f2 = filtered.withColumnRenamed("start", "fStart").withColumnRenamed("end", "fEnd")
f2.join(edges, f2("fEnd") === edges("start")).show

I believe this is because filtered("start").equals(edges("start")) , that is as filtered is a filtered view on edges and they share the column definitions. 我相信这是因为filtered("start").equals(edges("start")) ,即filtered是边缘上的已过滤视图,它们共享列定义。 The columns are the same so Spark does not understand which you are referencing. 这些列相同,因此Spark无法理解您所引用的对象。

As such you can do things like 因此,您可以做类似的事情

edges.select(filtered("start")).show

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM