简体   繁体   English

在 pyspark 中保留至少一个元素满足条件的组

[英]Keep groups where at least one element satisfies condition in pyspark

I've been trying to reproduce in pyspark something that is fairly easy to do in Pandas, but I've been struggling for a while now.我一直试图在 pyspark 中重现一些在 Pandas 中很容易做到的事情,但我已经挣扎了一段时间。 Say I have the following dataframe:假设我有以下数据框:

df = pd.DataFrame({'a':[1,2,2,1,1,2], 'b':[12,5,1,19,2,7]})
print(df)
   a   b
0  1  12
1  2   5
2  2   1
3  1  19
4  1   2
5  2   7

And the list还有名单

l = [5,1]

What I'm trying to do, is to group by a , and if any of the elements in b are in the list, then return True for all values in the group.我想要做的是按a分组,如果b中的任何元素在列表中,则为组中的所有值返回True Then we could use the result to index the dataframe.然后我们可以使用结果来索引数据帧。 The Pandas equivalent of this, would be:与此相当的 Pandas 将是:

df[df.b.isin(l).groupby(df.a).transform('any')]

   a  b
1  2  5
2  2  1
5  2  7

Reproducible dataframe in pyspark: pyspark 中可重现的数据框:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

df = pd.DataFrame({'a':[1,2,2,1,1,2], 'b':[12,5,1,19,2,7]})
sparkdf = spark.createDataFrame(df)

I was currently going in the direction of grouping by a and applying a pandasUDF, though there's surely a better way to do this using spark only.我目前正朝着按a分组并应用 pandasUDF 的方向前进,尽管肯定有更好的方法可以仅使用 spark 来做到这一点。

I've figured out a simple enough solution.我想出了一个足够简单的解决方案。 The first step is to filter out rows where the values in b are in the list using isin and filter , and then keeping the unique grouping keys ( a ) in a list.第一步是使用isinfilter过滤掉b中的值在列表中的行,然后将唯一的分组键 ( a ) 保留在列表中。

Then by merging back with the dataframe on a we keep groups contained in the list:然后,通过合并与数据框后面a我们保持列表中包含的组:

unique_a = (sparkdf.filter(f.col('b').isin(l))
                   .select('a').distinct())
sparkdf.join(unique_a, 'a').show()

+---+---+
|  a|  b|
+---+---+
|  2|  5|
|  2|  1|
|  2|  7|
+---+---+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM