在 pyspark 中保留至少一个元素满足条件的组

Question

I've been trying to reproduce in pyspark something that is fairly easy to do in Pandas, but I've been struggling for a while now.我一直试图在 pyspark 中重现一些在 Pandas 中很容易做到的事情，但我已经挣扎了一段时间。 Say I have the following dataframe:假设我有以下数据框：

df = pd.DataFrame({'a':[1,2,2,1,1,2], 'b':[12,5,1,19,2,7]})
print(df)
   a   b
0  1  12
1  2   5
2  2   1
3  1  19
4  1   2
5  2   7

And the list还有名单

l = [5,1]

What I'm trying to do, is to group by a , and if any of the elements in b are in the list, then return True for all values in the group.我想要做的是按a分组，如果b中的任何元素在列表中，则为组中的所有值返回True 。 Then we could use the result to index the dataframe.然后我们可以使用结果来索引数据帧。 The Pandas equivalent of this, would be:与此相当的 Pandas 将是：

df[df.b.isin(l).groupby(df.a).transform('any')]

   a  b
1  2  5
2  2  1
5  2  7

Reproducible dataframe in pyspark: pyspark 中可重现的数据框：

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

df = pd.DataFrame({'a':[1,2,2,1,1,2], 'b':[12,5,1,19,2,7]})
sparkdf = spark.createDataFrame(df)

I was currently going in the direction of grouping by a and applying a pandasUDF, though there's surely a better way to do this using spark only.我目前正朝着按a分组并应用 pandasUDF 的方向前进，尽管肯定有更好的方法可以仅使用 spark 来做到这一点。

Answer 1

I've figured out a simple enough solution.我想出了一个足够简单的解决方案。 The first step is to filter out rows where the values in b are in the list using isin and filter , and then keeping the unique grouping keys ( a ) in a list.第一步是使用isin和filter过滤掉b中的值在列表中的行，然后将唯一的分组键 ( a ) 保留在列表中。

Then by merging back with the dataframe on a we keep groups contained in the list:然后，通过合并与数据框后面a我们保持列表中包含的组：

unique_a = (sparkdf.filter(f.col('b').isin(l))
                   .select('a').distinct())
sparkdf.join(unique_a, 'a').show()

+---+---+
|  a|  b|
+---+---+
|  2|  5|
|  2|  1|
|  2|  7|
+---+---+

在 pyspark 中保留至少一个元素满足条件的组

问题描述

1 个解决方案

解决方案1
2 已采纳 2020-11-04 14:50:13

在 pyspark 中保留至少一个元素满足条件的组

问题描述

1 个解决方案

解决方案1 2 已采纳 2020-11-04 14:50:13

解决方案1
2 已采纳 2020-11-04 14:50:13