对 Pyspark 数据帧进行分组和过滤

Question

I have a PySpark data Fram with 3 columns.我有一个包含 3 列的 PySpark 数据帧。 Some rows are similar in 2 columns but not the third one, see below example.有些行在 2 列中相似，但在第三列中不相似，请参见下面的示例。

----------------------------------------
first_name | last_name | requests_ID    |
----------------------------------------
Joe        | Smith     |[2,3]           |
---------------------------------------- 
Joe        | Smith     |[2,3,5,6]       |
---------------------------------------- 
Jim        | Bush      |[9,7]           |
---------------------------------------- 
Jim        | Bush      |[21]            |
---------------------------------------- 
Sarah      | Wood      |[2,3]           |
----------------------------------------

I want to group-by the rows based on {first_name, last_name} columns and only have the row with the maximum number of {requests_ID}.我想根据 {first_name, last_name} 列对行进行分组，并且只有 {requests_ID} 数量最多的行。 So the results should be:所以结果应该是：

----------------------------------------
first_name | last_name | requests_ID    |
----------------------------------------
Joe        | Smith     |[2,3,5,6]       |
---------------------------------------- 
Jim        | Bush      |[9,7]           |
---------------------------------------- 
Sarah      | Wood      |[2,3]           |
----------------------------------------

I have tries different things like the following, but it gives me a nested array of both rows in the group-by and not the longest one.我尝试了以下不同的事情，但它为我提供了 group-by 中两行的嵌套数组，而不是最长的。

gr_df = filtered_df.groupBy("first_name", "last_name").agg(F.collect_set("requests_ID").alias("requests_ID"))

Here is the results I get:这是我得到的结果：

----------------------------------------
first_name | last_name | requests_ID    |
----------------------------------------
Joe        | Smith     |[[9,7],[2,3,5,6]]|
---------------------------------------- 
Jim        | Bush      |[[9,7],[21]]    |
---------------------------------------- 
Sarah      | Wood      |[2,3]           |
----------------------------------------

Answer 1

To follow through with your current df that looks like this,要完成您当前的 df 看起来像这样，

----------------------------------------
first_name | last_name | requests_ID    |
----------------------------------------
Joe        | Smith     |[[9,7],[2,3,5,6]]|
---------------------------------------- 
Jim        | Bush      |[[9,7],[21]]    |
---------------------------------------- 
Sarah      | Wood      |[2,3]           |
----------------------------------------

try this,尝试这个，

import pyspark.sql.functions as F
from pyspark.sql.types import IntegerType, ArrayType

def myfunc(x):
  temp = []
  for _ in x:
    temp.append(len(x))

  max_ind = temp.index(max(temp))

  return x[max_ind]

udf_extract = F.udf(myfunc, ArrayType(IntegerType()))

df = df.withColumn('new_requests_ID', udf_extract('requests_ID'))

#df.show()

or alternatively, without variable declaration,或者，没有变量声明，

import pyspark.sql.functions as F

@F.udf
def myfunc(x):
  temp = []
  for _ in x:
    temp.append(len(x))

  max_ind = temp.index(max(temp))

  return x[max_ind]

df = df.withColumn('new_requests_ID', myfunc('requests_ID'))

#df.show()

Answer 2

You can use size to determine the length of array column and the use window like below:您可以使用size来确定数组列的长度，并使用window如下所示：

Imports and create sample DataFrame导入并创建示例 DataFrame

import pyspark.sql.functions as f
from pyspark.sql.window import Window

df = spark.createDataFrame([('Joe','Smith',[2,3]),
('Joe','Smith',[2,3,5,6]),
('Jim','Bush',[9,7]),
('Jim','Bush',[21]),
('Sarah','Wood',[2,3])], ('first_name','last_name','requests_ID'))

Define window to get row number of requests_ID column in based on length of column in descending order.定义 window 以根据列的长度按降序获取requests_ID列的行数。

Here, f.size("requests_ID") will give length of requests_ID column and desc() will sort it in descending order.在这里， f.size("requests_ID")将给出requests_ID列的长度，而desc()将按降序对其进行排序。

w_spec = Window().partitionBy("first_name", "last_name").orderBy(f.size("requests_ID").desc())

Apply window function and get first row.应用 window function 并获得第一行。

df.withColumn("rn", f.row_number().over(w_spec)).where("rn ==1").drop("rn").show()
+----------+---------+------------+
|first_name|last_name| requests_ID|
+----------+---------+------------+
|       Jim|     Bush|      [9, 7]|
|     Sarah|     Wood|      [2, 3]|
|       Joe|    Smith|[2, 3, 5, 6]|
+----------+---------+------------+

对 Pyspark 数据帧进行分组和过滤

问题描述

2 个解决方案

解决方案1
1 2019-10-04 18:32:03

解决方案2
1 已采纳 2019-10-04 20:18:51

对 Pyspark 数据帧进行分组和过滤

问题描述

2 个解决方案

解决方案1 1 2019-10-04 18:32:03

解决方案2 1 已采纳 2019-10-04 20:18:51

解决方案1
1 2019-10-04 18:32:03

解决方案2
1 已采纳 2019-10-04 20:18:51