简体   繁体   English

与pyspark连续多次连接

[英]Multiple consecutive join with pyspark

I'm trying to join multiple DF together. 我正在尝试将多个DF连接在一起。 Because how join work, I got the same column name duplicated all over. 因为如何加入工作,我得到了相同的列名重复。

When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key. 当调用类型(K,V)和(K,W)的数据集时,返回(K,(V,W))对的数据集以及每个键的所有元素对。

# Join Min and Max to S1
joinned_s1 = (minTime.join(maxTime, minTime["UserId"] == maxTime["UserId"]))

# Join S1 and sum to s2
joinned_s2 = (joinned_s1.join(sumTime, joinned_s1["UserId"] == sumTime["UserId"]))

I got this error: ""Reference 'UserId' is ambiguous, could be: UserId#1578, UserId#3014.;" 我收到此错误:“”引用'UserId'不明确,可能是:UserId#1578,UserId#3014。;“

What is the proper way of removing W from my dataset once successfully joined? 成功加入W后,从数据集中删除W的正确方法是什么?

You can use equi-join: 您可以使用equi-join:

 minTime.join(maxTime, ["UserId"]).join(sumTime, ["UserId"])

aliases: 别名:

minTime.alias("minTime").join(
    maxTime.alias("maxTime"), 
    col("minTime.UserId") == col("maxTime.UserId")
)

or reference parent table: 或引用父表:

(minTime
  .join(maxTime, minTime["UserId"] == maxTime["UserId"])
  .join(sumTime, minTime["UserId"] == sumTime["UserId"]))

On as side note you're quoting RDD docs, not DataFrame ones. 在旁注中,您引用的是RDD文档,而不是DataFrame文档。 These are different data structures and don't operate in the same way. 这些是不同的数据结构,并且不以相同的方式操作。

Also it looks like you're doing something strange here. 而且看起来你在这里做的事情很奇怪。 Assuming you have a single parent table min , max and sum can be computed as simple aggregations without join . 假设您有一个父表minmaxsum可以计算为没有join简单聚合。

If you join two data frames on columns then the columns will be duplicated. 如果在列上连接两个数据框,则列将被复制。 So try to use an array or string for joining two or more data frames. 因此,尝试使用数组或字符串来连接两个或多个数据帧。

For example, if joining on columns: 例如,如果加入列:

df = left.join(right, left.name == right.name)

Output will consist of two columns with "name". 输出将包含两列“name”。

Now if you use: 现在,如果您使用:

df = left.join(right, "name") OR df=left.join(right,["name"])

Then output will not have duplicate columns. 然后输出将没有重复的列。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM