简体   繁体   English

如何使用 where 子句从 pyspark dataframe 中获取值

[英]How to get a value from one pyspark dataframe using where clause

I am very new to Pyspark.我对 Pyspark 很陌生。 So I have one requirement in which I have to get one column say 'id' from one MYSQL table and for each id, I need to get 'HOST' value which is column in another MYSQL table.所以我有一个要求,我必须从一个 MYSQL 表中获取一列说“id”,对于每个 id,我需要获取另一个 MYSQL 表中的列的“HOST”值。 So 1st part I have completed and I am getting id by using below piece of code.所以我已经完成了第一部分,我通过使用下面的代码来获取 id。

criteria_df = read_data_from_table(criteria_tbl)
datasource_df = read_data_from_table(data_source_tbl)
import pyspark.sql.functions as F

for row in criteria_df.collect(): 
  account_id = row["account_id"]
  criteria_name = row["criteria"]
  datasource_df = datasource_df.select(F.col('host')).where(F.col('id') == account_id)
  datasource_df.show()

But when I am trying to get host value for each id, I am not getting any value.但是当我试图为每个 id 获取主机值时,我没有得到任何值。

You should put the where clause before the select clause, otherwise it always return nothing because the column in the where clause does not exist.您应该将where子句放在select子句之前,否则它总是不返回任何内容,因为where子句中的列不存在。

datasource_df = datasource_df.where(F.col('id') == account_id).select(F.col('host'))

Also for this type of query, it's better to do a join , instead of collecting dataframes and comparing them row by row.同样对于这种类型的查询,最好进行join ,而不是收集数据帧并逐行比较它们。

You can use the semi-join:您可以使用半连接:

datasource_df.join(criteria_df, on=datasource_df['id'] == criteria_df['account_id'], how='left_semi')\
.select(F.col('host'))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM