select all or specific columns which is not in group by using pyspark or spark SQL

Question

I am trying to get a result set from below data. Displayed is a sample data. I am trying to get distinct value set for Name and Department. I could see several answers related to getting the distinct count but could not find one matching to my scenario or may be somewhere I was not able to find.

employeeDF = sqlContext.createDataFrame([('1235','Hary','IT','U'),
                                         ('879','Jack','PTA','R'),
                                  ('32569','Hary','IT','T'),
                                         ('4598','MiKe','HR','Y')],
                                 ['ID','Name','Department','Tag']) 
(employeeDF.show())
employeeDF.createOrReplaceTempView("employee")
+-----+----+----------+--+
|   ID|Name|Department|Tag|
+-----+----+----------+--+
| 1235|Hary|        IT|U|
|  879|Jack|       PTA|R|
|32569|Hary|        IT|T|
| 4598|MiKe|        HR|Y|
+-----+----+----------+--+

So in the dataset you can see Hary is having 2 different id's,So I am considering that as a junk data and I don't want them in my result set. What I am trying to achieve is a set without duplicate ID's

My expected output is

++----+-------------+
|ID  |Name|Department|
+----+----+----------+
|879 |Jack|       PTA|
|4598|MiKe|        HR|
+----+----+----------+

By running below query I am able to get below set by grouping by Name and Department, but along with that I need ID also. I cannot use ID in group by as I am trying to get distinct set of Name and Dept with unique ID

df = sqlContext.sql("select Name,Department from employee group by Name,Department having (count(distinct ID) =1 )")
df.show()
+----+----------+
|Name|Department|
+----+----------+
|Jack|       PTA|
|MiKe|        HR|
+----+----------+

Update 1 Here I have duplicated Mike's entry to make sure collect_set gets removes the duplicate id and take the count with respect to Name and Department.

employeeDF = sqlContext.createDataFrame([('1235','Hary','IT'),
                                         ('879','Jack','PTA'),
                                  ('32569','Hary','IT'),
                                         ('4598','MiKe','HR'),('4598','MiKe','HR')],
                                 ['ID','Name','Department']) 
(employeeDF.show())
employeeDF.createOrReplaceTempView("employee")

from pyspark.sql import functions as F, Window

result = employeeDF.withColumn(
    'count_id', 
    F.size(F.collect_set('ID').over(Window.partitionBy('Name', 'Department')))
).filter('count_id = 1').drop('count_id')
result.show()

+-----+----+----------+--------+
|   ID|Name|Department|count_id|
+-----+----+----------+--------+
|  879|Jack|       PTA|       1|
| 4598|MiKe|        HR|       1|
| 4598|MiKe|        HR|       1|
|32569|Hary|        IT|       2|
| 1235|Hary|        IT|       2|
+-----+----+----------+--------+

Mike 's entry is a duplicate entry with same id, so while taking group by of (Mike,HR) I need them as one entry with count 1, as the ids' are same with respect of name and HR

Expected Result

 +-----+----+----------+--------+
    |   ID|Name|Department|count_id|
    +-----+----+----------+--------+
    |  879|Jack|       PTA|       1|
    | 4598|MiKe|        HR|       1|
    |32569|Hary|        IT|       2|
    | 1235|Hary|        IT|       2|
    +-----+----+----------+--------+

Answer 1

You can calculate the distinct count of ID for each name using size(collect_set()) :

from pyspark.sql import functions as F, Window

result = employeeDF.withColumn(
    'count_id', 
    F.size(F.collect_set('ID').over(Window.partitionBy('Name', 'Department')))
).filter('count_id = 1').drop('count_id').distinct()

result.show()
+----+----+----------+
|  ID|Name|Department|
+----+----+----------+
| 879|Jack|       PTA|
|4598|MiKe|        HR|
+----+----+----------+

Answer 2

You can use exists in Spark SQL:

df = sqlContext.sql("""
select * 
from employee e1 
where not exists (
            select 1 
            from employee e2 
            where e1.Department = e2.Department 
            and e1.Name = e2.Name and e1.ID != e2.ID
            )
""")

df.show()

#+----+----+----------+---+
#|  ID|Name|Department|Tag|
#+----+----+----------+---+
#|4598|MiKe|        HR|  Y|
#| 879|Jack|       PTA|  R|
#+----+----+----------+---+

select all or specific columns which is not in group by using pyspark or spark SQL

Question

2 answers

solution1
1 ACCPTED 2021-02-25 15:10:38

solution2
1 2021-02-25 15:23:26

select all or specific columns which is not in group by using pyspark or spark SQL

Question

2 answers

solution1 1 ACCPTED 2021-02-25 15:10:38

solution2 1 2021-02-25 15:23:26

solution1
1 ACCPTED 2021-02-25 15:10:38

solution2
1 2021-02-25 15:23:26