[英]How to join two dataframes?
I cannot get Sparks DataFrame join to work (no result gets produced). 我无法使Sparks DataFrame加入工作(没有产生结果)。 Here is my code: 这是我的代码:
val e = Seq((1, 2), (1, 3), (2, 4))
var edges = e.map(p => Edge(p._1, p._2)).toDF()
var filtered = edges.filter("start = 1").distinct()
println("filtered")
filtered.show()
filtered.printSchema()
println("edges")
edges.show()
edges.printSchema()
var joined = filtered.join(edges, filtered("end") === edges("start"))//.select(filtered("start"), edges("end"))
println("joined")
joined.show()
It requires case class Edge(start: Int, end: Int)
to be defined at top level. 它要求在顶级定义case class Edge(start: Int, end: Int)
。 Here is the output it produces: 这是它产生的输出:
filtered
+-----+---+
|start|end|
+-----+---+
| 1| 2|
| 1| 3|
+-----+---+
root
|-- start: integer (nullable = false)
|-- end: integer (nullable = false)
edges
+-----+---+
|start|end|
+-----+---+
| 1| 2|
| 1| 3|
| 2| 4|
+-----+---+
root
|-- start: integer (nullable = false)
|-- end: integer (nullable = false)
joined
+-----+---+-----+---+
|start|end|start|end|
+-----+---+-----+---+
+-----+---+-----+---+
I don't understand why the output is empty. 我不明白为什么输出为空。 Why isn't the first row of filtered
get combined with the last row of edges
? 为什么filtered
的第一行与edges
的最后一行不合并?
val f2 = filtered.withColumnRenamed("start", "fStart").withColumnRenamed("end", "fEnd")
f2.join(edges, f2("fEnd") === edges("start")).show
I believe this is because filtered("start").equals(edges("start"))
, that is as filtered
is a filtered view on edges and they share the column definitions. 我相信这是因为filtered("start").equals(edges("start"))
,即filtered
是边缘上的已过滤视图,它们共享列定义。 The columns are the same so Spark does not understand which you are referencing. 这些列相同,因此Spark无法理解您所引用的对象。
As such you can do things like 因此,您可以做类似的事情
edges.select(filtered("start")).show
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.