简体   繁体   中英

How to get a value from one pyspark dataframe using where clause

I am very new to Pyspark. So I have one requirement in which I have to get one column say 'id' from one MYSQL table and for each id, I need to get 'HOST' value which is column in another MYSQL table. So 1st part I have completed and I am getting id by using below piece of code.

criteria_df = read_data_from_table(criteria_tbl)
datasource_df = read_data_from_table(data_source_tbl)
import pyspark.sql.functions as F

for row in criteria_df.collect(): 
  account_id = row["account_id"]
  criteria_name = row["criteria"]
  datasource_df = datasource_df.select(F.col('host')).where(F.col('id') == account_id)
  datasource_df.show()

But when I am trying to get host value for each id, I am not getting any value.

You should put the where clause before the select clause, otherwise it always return nothing because the column in the where clause does not exist.

datasource_df = datasource_df.where(F.col('id') == account_id).select(F.col('host'))

Also for this type of query, it's better to do a join , instead of collecting dataframes and comparing them row by row.

You can use the semi-join:

datasource_df.join(criteria_df, on=datasource_df['id'] == criteria_df['account_id'], how='left_semi')\
.select(F.col('host'))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM