[英]Filter on multiple columns in Spark Dataframe based API
I have a dataframe like:我有一个数据框,如:
+--------+-------+--------------------+-------------------+
| id1| id2| body| created_at|
+--------+-------+--------------------+-------------------+
|1 | 4|....................|2017-10-01 00:00:05|
|2 | 3|....................|2017-10-01 00:00:05|
|3 | 2|....................|2017-10-01 00:00:05|
|4 | 1|....................|2017-10-01 00:00:05|
+--------+-------+--------------------+-------------------+
I would like to filter the table using both id1
and id2
.我想同时使用
id1
和id2
过滤表。 For example get rows where id1=1, id2=4
and id1=2, id2=3
.例如获取
id1=1, id2=4
和id1=2, id2=3
。
Currently, I'm using loop to generate a giant query string for df.filter()
, ie ((id1 = 1) and (id2 = 4)) or ((id1 = 2) and (id2 = 3))
.目前,我正在使用循环为
df.filter()
生成一个巨大的查询字符串,即((id1 = 1) and (id2 = 4)) or ((id1 = 2) and (id2 = 3))
。 Just wondering if there is a more properly way to achieve this?只是想知道是否有更合适的方法来实现这一目标?
You can generate a helper DF (table):您可以生成一个辅助 DF(表):
tmp:时间:
+--------+-------+
| id1| id2|
+--------+-------+
|1 | 4|
|2 | 3|
+--------+-------+
and then join them:然后加入他们:
SELECT a.*
FROM tab a
JOIN tmp b
ON (a.id1 = b.id1 and a.id2 = b.id2)
where tab
is your original DF, registered as a table其中
tab
是您的原始 DF,注册为表格
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.