I am very new to Pyspark. So I have one requirement in which I have to get one column say 'id' from one MYSQL table and for each id, I need to get 'HOST' value which is column in another MYSQL table. So 1st part I have completed and I am getting id by using below piece of code.
criteria_df = read_data_from_table(criteria_tbl)
datasource_df = read_data_from_table(data_source_tbl)
import pyspark.sql.functions as F
for row in criteria_df.collect():
account_id = row["account_id"]
criteria_name = row["criteria"]
datasource_df = datasource_df.select(F.col('host')).where(F.col('id') == account_id)
datasource_df.show()
But when I am trying to get host value for each id, I am not getting any value.
You should put the where
clause before the select
clause, otherwise it always return nothing because the column in the where
clause does not exist.
datasource_df = datasource_df.where(F.col('id') == account_id).select(F.col('host'))
Also for this type of query, it's better to do a join
, instead of collecting dataframes and comparing them row by row.
You can use the semi-join:
datasource_df.join(criteria_df, on=datasource_df['id'] == criteria_df['account_id'], how='left_semi')\
.select(F.col('host'))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.