如何 select 基于多个条件的前 n 行项目 pyspark

Question

Now I have data like this:现在我有这样的数据：

+----+----+
|col1|   d|
+----+----+
|   A|   4|
|   A|  10|
|   A|   3|
|   B|   3|
|   B|   6|
|   B|   4|
|   B| 5.5|
|   B|  13|
+----+----+

col1 is StringType, d is TimestampType, here I use DoubleType instead. col1是StringType，d是TimestampType，这里我用DoubleType代替。 I want to generate data based on conditions tuples.我想根据条件元组生成数据。 Given a tuple[(A,3.5),(A,8),(B,3.5),(B,10)] I want to have the result like给定一个元组[(A,3.5),(A,8),(B,3.5),(B,10)] 我希望得到类似的结果

+----+---+
|col1|  d|
+----+---+
|   A|  4|
|   A| 10|
|   B|  4|
|   B| 13|
+----+---+

That is for each element in the tuple, we select from the pyspark dataframe the first 1 row that d is larger than the tuple number and col1 is equal to the tuple string.那就是对于元组中的每个元素，我们从 pyspark dataframe 中的 select dataframe 的前 1 行 d 大于元组字符串和 col1 字符串。 What I've already written is:我已经写的是：

df_res=spark_empty_dataframe    
for (x,y) in tuples:
         dft=df.filter(df.col1==x).filter(df.d>y).limit(1)
         df_res=df_res.union(dft)

But I think this might have efficiency problem, I do not know if I were right.但我认为这可能有效率问题，我不知道我是否正确。

Answer 1

A possible approach avoiding loops can be creating a dataframe from the tuple you have as input:避免循环的一种可能方法是从您作为输入的元组创建 dataframe ：

t = [('A',3.5),('A',8),('B',3.5),('B',10)]
ref=spark.createDataFrame([(i[0],float(i[1])) for i in t],("col1_y","d_y"))

Then we can join on the input dataframe( df ) on condition and then group on the keys and values of tuple which will be repeated to get the first value on each group, then drop the extra columns:然后我们可以在条件上加入输入数据帧（ df ），然后对元组的键和值进行分组，这些键和值将被重复以获得每个组的第一个值，然后删除额外的列：

(df.join(ref,(df.col1==ref.col1_y)&(df.d>ref.d_y),how='inner').orderBy("col1","d")

.groupBy("col1_y","d_y").agg(F.first("col1").alias("col1"),F.first("d").alias("d"))

.drop("col1_y","d_y")).show()

+----+----+
|col1|   d|
+----+----+
|   A|10.0|
|   A| 4.0|
|   B| 4.0|
|   B|13.0|
+----+----+

Note, if order of the dataframe is important, you can assign an index column with monotonically_increasing_id and include them in the aggregation then orderBy the index column.请注意，如果 dataframe 的顺序很重要，您可以使用monotonically_increasing_id分配一个索引列并将它们包含在聚合中，然后按索引列排序。

EDIT another way instead of ordering and get first directly with min :编辑另一种方式，而不是订购并直接使用min获得first ：

(df.join(ref,(df.col1==ref.col1_y)&(df.d>ref.d_y),how='inner')

.groupBy("col1_y","d_y").agg(F.min("col1").alias("col1"),F.min("d").alias("d"))

.drop("col1_y","d_y")).show()

+----+----+
|col1|   d|
+----+----+
|   B| 4.0|
|   B|13.0|
|   A| 4.0|
|   A|10.0|
+----+----+

如何 select 基于多个条件的前 n 行项目 pyspark

问题描述

1 个解决方案

解决方案1
2 已采纳 2020-06-21 09:00:22

如何 select 基于多个条件的前 n 行项目 pyspark

问题描述

1 个解决方案

解决方案1 2 已采纳 2020-06-21 09:00:22

解决方案1
2 已采纳 2020-06-21 09:00:22