在基于 Spark Dataframe 的 API 中过滤多列

Question

I have a dataframe like:我有一个数据框，如：

+--------+-------+--------------------+-------------------+
|     id1|    id2|                body|         created_at|
+--------+-------+--------------------+-------------------+
|1       |      4|....................|2017-10-01 00:00:05|
|2       |      3|....................|2017-10-01 00:00:05|
|3       |      2|....................|2017-10-01 00:00:05|
|4       |      1|....................|2017-10-01 00:00:05|
+--------+-------+--------------------+-------------------+

I would like to filter the table using both id1 and id2 .我想同时使用id1和id2过滤表。 For example get rows where id1=1, id2=4 and id1=2, id2=3 .例如获取id1=1, id2=4和id1=2, id2=3 。

Currently, I'm using loop to generate a giant query string for df.filter() , ie ((id1 = 1) and (id2 = 4)) or ((id1 = 2) and (id2 = 3)) .目前，我正在使用循环为df.filter()生成一个巨大的查询字符串，即((id1 = 1) and (id2 = 4)) or ((id1 = 2) and (id2 = 3)) 。 Just wondering if there is a more properly way to achieve this?只是想知道是否有更合适的方法来实现这一目标？

Answer 1

You can generate a helper DF (table):您可以生成一个辅助 DF（表）：

tmp:时间：

+--------+-------+
|     id1|    id2|
+--------+-------+
|1       |      4|
|2       |      3|
+--------+-------+

and then join them:然后加入他们：

SELECT a.*
FROM tab a
JOIN tmp b
  ON (a.id1 = b.id1 and a.id2 = b.id2)

where tab is your original DF, registered as a table其中tab是您的原始 DF，注册为表格

在基于 Spark Dataframe 的 API 中过滤多列

问题描述

1 个解决方案

解决方案1
1 已采纳 2017-10-12 22:41:21

在基于 Spark Dataframe 的 API 中过滤多列

问题描述

1 个解决方案

解决方案1 1 已采纳 2017-10-12 22:41:21

解决方案1
1 已采纳 2017-10-12 22:41:21