繁体   English   中英

pyspark:从数据集中获得 n 个不同 ID 的最佳方法是什么?

[英]pyspark: what is the best way to select n distinct IDs from a dataset


    id manager score
    A     x     3
    A     y     1
    B     a     2
    B     b     5
    C     f     2
    D     f     6

我所期望的正是结果数据集中的 n 个 ID。

例如。 如果我说需要 3 个 ID,那么生成的数据集将如下所示:

id manager score
 A   x      3
 A   y      1
 B   a      2
 B   b      5
 C   f      2

所以这个数据集有 5 行,但正如预期的那样正好有 3 个 ID。 如果我执行 df.limit(3) 它将仅是前 3 条记录,因为 ID 重复,我将获得少于 3 个 ID,即在这种情况下只有 2 个。 如何进行? 谢谢

您可以使用 spark sql 查询来执行此操作。


df=spark.createDataFrame([("A", "x", "3"), ("A", "y", "1"), ("B", "a", "2"), ("B", "b", "5"), ("C", "v", "2"), ("D", "f", "6")], ["id", "manager", "score"])

| id|manager|score|
|  A|      x|    3|
|  A|      y|    1|
|  B|      a|    2|
|  B|      b|    5|
|  C|      v|    2|
|  D|      f|    6|


sql("""select * from employee e1 
     inner join (  
        select distinct id as uni_id from employee order by uni_id limit 3) e2 
     on e1.id=e2.uni_id""").drop("uni_id").show() 

| id|manager|score|
|  A|      x|    3|
|  A|      y|    1|
|  B|      a|    2|
|  B|      b|    5|
|  C|      v|    2|


import pyspark.sql.functions as F
from pyspark.ml.feature import StringIndexer

n = 3 #change as per limit
idx = StringIndexer(inputCol="id",outputCol="id_num")

| id|manager|score|
|  A|      x|    3|
|  A|      y|    1|
|  B|      a|    2|
|  B|      b|    5|
|  C|      f|    2|

这是使用 'colllect_set' function 和一些 pythonic 操作的简单方法:

idLimit=3 #define your limit

id_lst=(sourceDF  #collect a list of distinct ids

id_lst.sort() #sort the ids alphabatically

id_lst_limited=id_lst[:idLimit] #limit the list as per your defined limit

targetDF=(sourceDF #filter the source df using your limited list
          .filter("id in ({0})".format(str(id_lst_limited)[1:-1]))

您可以通过where id in (select distinct id...limit 3)简单地避免join ,如下所示 -

 val df = Seq(("A", "x", "3"), ("A", "y", "1"), ("B", "a", "2"), ("B", "b", "5"), ("C", "v", "2"), ("D", "f", "6"))
    .toDF("id", "manager", "score")

      * +---+-------+-----+
      * |id |manager|score|
      * +---+-------+-----+
      * |A  |x      |3    |
      * |A  |y      |1    |
      * |B  |a      |2    |
      * |B  |b      |5    |
      * |C  |v      |2    |
      * |D  |f      |6    |
      * +---+-------+-----+

    spark.sql("select * from employee where id in (select distinct id from employee order by id limit 3)")

      * +---+-------+-----+
      * |id |manager|score|
      * +---+-------+-----+
      * |A  |x      |3    |
      * |A  |y      |1    |
      * |B  |a      |2    |
      * |B  |b      |5    |
      * |C  |v      |2    |
      * +---+-------+-----+

我注意到上面的答案之一是基于 Spark SQL。

这是另一种基于 Spark SQL 的方法,但带有WINDOW子句 -

sql("select id, manager, score from (select e1.id, e1.manager, e1.score, dense_rank() over (order by e1.id) as rrank from employee e1) where rrank <= 3").show()

完整代码 -

df=spark.createDataFrame([("A", "x", "3"), ("A", "y", "1"), ("B", "a", "2"), ("B", "b", "5"), ("C", "v", "2"), ("D", "f", "6")], ["id", "manager", "score"])

| id|manager|score|
|  A|      x|    3|
|  A|      y|    1|
|  B|      a|    2|
|  B|      b|    5|
|  C|      v|    2|
|  D|      f|    6|


sql("select id, manager, score from (select e1.id, e1.manager, e1.score, dense_rank() over (order by e1.id) as rrank from employee e1) where rrank <= 3").show()

| id|manager|score|
|  A|      x|    3|
|  A|      y|    1|
|  B|      a|    2|
|  B|      b|    5|
|  C|      v|    2|


声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM