Multiple consecutive join with pyspark

Question

I'm trying to join multiple DF together. Because how join work, I got the same column name duplicated all over.

When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key.

# Join Min and Max to S1
joinned_s1 = (minTime.join(maxTime, minTime["UserId"] == maxTime["UserId"]))

# Join S1 and sum to s2
joinned_s2 = (joinned_s1.join(sumTime, joinned_s1["UserId"] == sumTime["UserId"]))

I got this error: ""Reference 'UserId' is ambiguous, could be: UserId#1578, UserId#3014.;"

What is the proper way of removing W from my dataset once successfully joined?

Answer 1

You can use equi-join:

 minTime.join(maxTime, ["UserId"]).join(sumTime, ["UserId"])

aliases:

minTime.alias("minTime").join(
    maxTime.alias("maxTime"), 
    col("minTime.UserId") == col("maxTime.UserId")
)

or reference parent table:

(minTime
  .join(maxTime, minTime["UserId"] == maxTime["UserId"])
  .join(sumTime, minTime["UserId"] == sumTime["UserId"]))

On as side note you're quoting RDD docs, not DataFrame ones. These are different data structures and don't operate in the same way.

Also it looks like you're doing something strange here. Assuming you have a single parent table min , max and sum can be computed as simple aggregations without join .

Answer 2

If you join two data frames on columns then the columns will be duplicated. So try to use an array or string for joining two or more data frames.

For example, if joining on columns:

df = left.join(right, left.name == right.name)

Output will consist of two columns with "name".

Now if you use:

df = left.join(right, "name") OR df=left.join(right,["name"])

Then output will not have duplicate columns.

Multiple consecutive join with pyspark

Question

2 answers

solution1
5 ACCPTED 2016-07-19 22:53:00

solution2
0 2018-08-23 08:22:09

Multiple consecutive join with pyspark

Question

2 answers

solution1 5 ACCPTED 2016-07-19 22:53:00

solution2 0 2018-08-23 08:22:09

solution1
5 ACCPTED 2016-07-19 22:53:00

solution2
0 2018-08-23 08:22:09