I am trying to get a result set from below data. Displayed is a sample data. I am trying to get distinct value set for Name and Department. I could see several answers related to getting the distinct count but could not find one matching to my scenario or may be somewhere I was not able to find.
employeeDF = sqlContext.createDataFrame([('1235','Hary','IT','U'),
('879','Jack','PTA','R'),
('32569','Hary','IT','T'),
('4598','MiKe','HR','Y')],
['ID','Name','Department','Tag'])
(employeeDF.show())
employeeDF.createOrReplaceTempView("employee")
+-----+----+----------+--+
| ID|Name|Department|Tag|
+-----+----+----------+--+
| 1235|Hary| IT|U|
| 879|Jack| PTA|R|
|32569|Hary| IT|T|
| 4598|MiKe| HR|Y|
+-----+----+----------+--+
So in the dataset you can see Hary is having 2 different id's,So I am considering that as a junk data and I don't want them in my result set. What I am trying to achieve is a set without duplicate ID's
My expected output is
++----+-------------+
|ID |Name|Department|
+----+----+----------+
|879 |Jack| PTA|
|4598|MiKe| HR|
+----+----+----------+
By running below query I am able to get below set by grouping by Name and Department, but along with that I need ID also. I cannot use ID in group by as I am trying to get distinct set of Name and Dept with unique ID
df = sqlContext.sql("select Name,Department from employee group by Name,Department having (count(distinct ID) =1 )")
df.show()
+----+----------+
|Name|Department|
+----+----------+
|Jack| PTA|
|MiKe| HR|
+----+----------+
Update 1 Here I have duplicated Mike's entry to make sure collect_set gets removes the duplicate id and take the count with respect to Name and Department.
employeeDF = sqlContext.createDataFrame([('1235','Hary','IT'),
('879','Jack','PTA'),
('32569','Hary','IT'),
('4598','MiKe','HR'),('4598','MiKe','HR')],
['ID','Name','Department'])
(employeeDF.show())
employeeDF.createOrReplaceTempView("employee")
from pyspark.sql import functions as F, Window
result = employeeDF.withColumn(
'count_id',
F.size(F.collect_set('ID').over(Window.partitionBy('Name', 'Department')))
).filter('count_id = 1').drop('count_id')
result.show()
+-----+----+----------+--------+
| ID|Name|Department|count_id|
+-----+----+----------+--------+
| 879|Jack| PTA| 1|
| 4598|MiKe| HR| 1|
| 4598|MiKe| HR| 1|
|32569|Hary| IT| 2|
| 1235|Hary| IT| 2|
+-----+----+----------+--------+
Mike 's entry is a duplicate entry with same id, so while taking group by of (Mike,HR) I need them as one entry with count 1, as the ids' are same with respect of name and HR
Expected Result
+-----+----+----------+--------+
| ID|Name|Department|count_id|
+-----+----+----------+--------+
| 879|Jack| PTA| 1|
| 4598|MiKe| HR| 1|
|32569|Hary| IT| 2|
| 1235|Hary| IT| 2|
+-----+----+----------+--------+
You can calculate the distinct count of ID for each name using size(collect_set())
:
from pyspark.sql import functions as F, Window
result = employeeDF.withColumn(
'count_id',
F.size(F.collect_set('ID').over(Window.partitionBy('Name', 'Department')))
).filter('count_id = 1').drop('count_id').distinct()
result.show()
+----+----+----------+
| ID|Name|Department|
+----+----+----------+
| 879|Jack| PTA|
|4598|MiKe| HR|
+----+----+----------+
You can use exists
in Spark SQL:
df = sqlContext.sql("""
select *
from employee e1
where not exists (
select 1
from employee e2
where e1.Department = e2.Department
and e1.Name = e2.Name and e1.ID != e2.ID
)
""")
df.show()
#+----+----+----------+---+
#| ID|Name|Department|Tag|
#+----+----+----------+---+
#|4598|MiKe| HR| Y|
#| 879|Jack| PTA| R|
#+----+----+----------+---+
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.