如何使用 where 子句从 pyspark dataframe 中获取值

Question

I am very new to Pyspark.我对 Pyspark 很陌生。 So I have one requirement in which I have to get one column say 'id' from one MYSQL table and for each id, I need to get 'HOST' value which is column in another MYSQL table.所以我有一个要求，我必须从一个 MYSQL 表中获取一列说“id”，对于每个 id，我需要获取另一个 MYSQL 表中的列的“HOST”值。 So 1st part I have completed and I am getting id by using below piece of code.所以我已经完成了第一部分，我通过使用下面的代码来获取 id。

criteria_df = read_data_from_table(criteria_tbl)

datasource_df = read_data_from_table(data_source_tbl)

import pyspark.sql.functions as F

for row in criteria_df.collect(): 
  account_id = row["account_id"]
  criteria_name = row["criteria"]
  datasource_df = datasource_df.select(F.col('host')).where(F.col('id') == account_id)
  datasource_df.show()

But when I am trying to get host value for each id, I am not getting any value.但是当我试图为每个 id 获取主机值时，我没有得到任何值。

Answer 1

You should put the where clause before the select clause, otherwise it always return nothing because the column in the where clause does not exist.您应该将where子句放在select子句之前，否则它总是不返回任何内容，因为where子句中的列不存在。

datasource_df = datasource_df.where(F.col('id') == account_id).select(F.col('host'))

Also for this type of query, it's better to do a join , instead of collecting dataframes and comparing them row by row.同样对于这种类型的查询，最好进行join ，而不是收集数据帧并逐行比较它们。

Answer 2

You can use the semi-join:您可以使用半连接：

datasource_df.join(criteria_df, on=datasource_df['id'] == criteria_df['account_id'], how='left_semi')\
.select(F.col('host'))

如何使用 where 子句从 pyspark dataframe 中获取值

问题描述

2 个解决方案

解决方案1
2 已采纳 2020-12-30 10:15:35

解决方案2
2 2020-12-30 10:26:38

如何使用 where 子句从 pyspark dataframe 中获取值

问题描述

2 个解决方案

解决方案1 2 已采纳 2020-12-30 10:15:35

解决方案2 2 2020-12-30 10:26:38

解决方案1
2 已采纳 2020-12-30 10:15:35

解决方案2
2 2020-12-30 10:26:38