简体   繁体   中英

Multiple consecutive join with pyspark

I'm trying to join multiple DF together. Because how join work, I got the same column name duplicated all over.

When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key.

# Join Min and Max to S1
joinned_s1 = (minTime.join(maxTime, minTime["UserId"] == maxTime["UserId"]))

# Join S1 and sum to s2
joinned_s2 = (joinned_s1.join(sumTime, joinned_s1["UserId"] == sumTime["UserId"]))

I got this error: ""Reference 'UserId' is ambiguous, could be: UserId#1578, UserId#3014.;"

What is the proper way of removing W from my dataset once successfully joined?

You can use equi-join:

 minTime.join(maxTime, ["UserId"]).join(sumTime, ["UserId"])

aliases:

minTime.alias("minTime").join(
    maxTime.alias("maxTime"), 
    col("minTime.UserId") == col("maxTime.UserId")
)

or reference parent table:

(minTime
  .join(maxTime, minTime["UserId"] == maxTime["UserId"])
  .join(sumTime, minTime["UserId"] == sumTime["UserId"]))

On as side note you're quoting RDD docs, not DataFrame ones. These are different data structures and don't operate in the same way.

Also it looks like you're doing something strange here. Assuming you have a single parent table min , max and sum can be computed as simple aggregations without join .

If you join two data frames on columns then the columns will be duplicated. So try to use an array or string for joining two or more data frames.

For example, if joining on columns:

df = left.join(right, left.name == right.name)

Output will consist of two columns with "name".

Now if you use:

df = left.join(right, "name") OR df=left.join(right,["name"])

Then output will not have duplicate columns.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM