I need to perform a triple join using 3 three dataframes in spark.
First, I obtain the main dataframe 'nodd' by performing a double join over the next parameters, with previously loaded dataframes 'dbs_files' and 'dbs_blocks'.
jobreports = spark.read.json(inputfile)
popularity = spark.read.json(hdir)
nodd= (popularity
.filter(col('data.site_name')=="T1_ES_PIC")
.join(dbs_files, col('data.file_lfn')==col('f_logical_file_name'))
.join(dbs_blocks, col('f_block_id')==col('b_block_id'))
.select('data.file_lfn', 'f_logical_file_name', 'f_creation_date', 'b_block_id', 'b_block_name'))
nodd.show(20)
Output:
+--------------------+--------------------+---------------+----------+--------------------+
| file_lfn| f_logical_file_name|f_creation_date|b_block_id| b_block_name|
+--------------------+--------------------+---------------+----------+--------------------+
|/store/mc/RunIISu...|/store/mc/RunIISu...| null| 23329663|/VBFHHTo2G2Qlnu_C...|
|/store/mc/RunIISu...|/store/mc/RunIISu...| null| 23329663|/VBFHHTo2G2Qlnu_C...|
...
And finally, I perform the last join over 'jobreports' dataframe on the specified parameters
final_join=nodd.join(jobreports, col('b_block_name')==col('CRAB_DataBlock'))
Obtaining the next error message:
Py4JJavaError: An error occurred while calling o109.join.
: org.apache.spark.sql.AnalysisException: cannot resolve '`CRAB_DataBlock`' given input columns: [f_creation_date, metadata, f_logical_file_name, data, b_block_id, file_lfn, b_block_name];;
...
AnalysisException: "cannot resolve '`CRAB_DataBlock`' given input columns: [f_creation_date, metadata, f_logical_file_name, data, b_block_id, file_lfn, b_block_name];;
I don't understand the error, because is exactly the same join performance over both dataframes using an exactly the same colum with the same format (both 'string' type columns).
Is any issue while performing the third join remaining or any other way to perform this triple join?
If column - CRAB_DataBlock
is coming from dataframe - jobreports
use below , this a coding standard
final_join=nodd.join(jobreports, nodd.b_block_name == jobreports.CRAB_DataBlock, "left" )
Post doing this, if you still get the above issue , that means the specified column is not available in the dataframe jobreports
and you need to further debug why?
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.