How to get a value from one pyspark dataframe using where clause

Question

I am very new to Pyspark. So I have one requirement in which I have to get one column say 'id' from one MYSQL table and for each id, I need to get 'HOST' value which is column in another MYSQL table. So 1st part I have completed and I am getting id by using below piece of code.

criteria_df = read_data_from_table(criteria_tbl)

datasource_df = read_data_from_table(data_source_tbl)

import pyspark.sql.functions as F

for row in criteria_df.collect(): 
  account_id = row["account_id"]
  criteria_name = row["criteria"]
  datasource_df = datasource_df.select(F.col('host')).where(F.col('id') == account_id)
  datasource_df.show()

But when I am trying to get host value for each id, I am not getting any value.

Answer 1

You should put the where clause before the select clause, otherwise it always return nothing because the column in the where clause does not exist.

datasource_df = datasource_df.where(F.col('id') == account_id).select(F.col('host'))

Also for this type of query, it's better to do a join , instead of collecting dataframes and comparing them row by row.

Answer 2

You can use the semi-join:

datasource_df.join(criteria_df, on=datasource_df['id'] == criteria_df['account_id'], how='left_semi')\
.select(F.col('host'))

How to get a value from one pyspark dataframe using where clause

Question

2 answers

solution1
2 ACCPTED 2020-12-30 10:15:35

solution2
2 2020-12-30 10:26:38

How to get a value from one pyspark dataframe using where clause

Question

2 answers

solution1 2 ACCPTED 2020-12-30 10:15:35

solution2 2 2020-12-30 10:26:38

solution1
2 ACCPTED 2020-12-30 10:15:35

solution2
2 2020-12-30 10:26:38