How to perform a triple join with pyspark using dataframes?

Question

I need to perform a triple join using 3 three dataframes in spark.

First, I obtain the main dataframe 'nodd' by performing a double join over the next parameters, with previously loaded dataframes 'dbs_files' and 'dbs_blocks'.

jobreports = spark.read.json(inputfile)
popularity = spark.read.json(hdir)

nodd= (popularity
       .filter(col('data.site_name')=="T1_ES_PIC")
       .join(dbs_files, col('data.file_lfn')==col('f_logical_file_name'))
       .join(dbs_blocks, col('f_block_id')==col('b_block_id'))
       .select('data.file_lfn', 'f_logical_file_name', 'f_creation_date', 'b_block_id', 'b_block_name'))

nodd.show(20)

Output:

+--------------------+--------------------+---------------+----------+--------------------+
|            file_lfn| f_logical_file_name|f_creation_date|b_block_id|        b_block_name|
+--------------------+--------------------+---------------+----------+--------------------+
|/store/mc/RunIISu...|/store/mc/RunIISu...|           null|  23329663|/VBFHHTo2G2Qlnu_C...|
|/store/mc/RunIISu...|/store/mc/RunIISu...|           null|  23329663|/VBFHHTo2G2Qlnu_C...|
...

And finally, I perform the last join over 'jobreports' dataframe on the specified parameters

final_join=nodd.join(jobreports, col('b_block_name')==col('CRAB_DataBlock'))

Obtaining the next error message:

Py4JJavaError: An error occurred while calling o109.join.
: org.apache.spark.sql.AnalysisException: cannot resolve '`CRAB_DataBlock`' given input columns: [f_creation_date, metadata, f_logical_file_name, data, b_block_id, file_lfn, b_block_name];;
...

AnalysisException: "cannot resolve '`CRAB_DataBlock`' given input columns: [f_creation_date, metadata, f_logical_file_name, data, b_block_id, file_lfn, b_block_name];;

I don't understand the error, because is exactly the same join performance over both dataframes using an exactly the same colum with the same format (both 'string' type columns).

Is any issue while performing the third join remaining or any other way to perform this triple join?

Answer 1

If column - CRAB_DataBlock is coming from dataframe - jobreports use below , this a coding standard

final_join=nodd.join(jobreports, nodd.b_block_name == jobreports.CRAB_DataBlock, "left" )

Post doing this, if you still get the above issue , that means the specified column is not available in the dataframe jobreports and you need to further debug why?

How to perform a triple join with pyspark using dataframes?

Question

1 answers

solution1
0 ACCPTED 2020-11-04 11:56:21

How to perform a triple join with pyspark using dataframes?

Question

1 answers

solution1 0 ACCPTED 2020-11-04 11:56:21

solution1
0 ACCPTED 2020-11-04 11:56:21