Pyspark: Get the amount of distinct combinations between two columns

Question

I need to be able to get the number of distinct combinations in two separate columns.

In this example from the "Animal" and "Color" columns, the result I want to get is 3, since three distinct combinations of the columns occur. Basically, Animal or Color can be the same among separate rows, but if two rows have the same Animal AND Color, it should be omitted from this count.

Animal | Color
Dog    | Brown
Dog    | White
Cat    | Black
Dog    | White

I know you can add data to a set and that will eliminate duplicates, but I couldn't seem to get it to work with multiple variables.

Here is the example code I was given to attempt to solve this.

d = d.rdd
d = d.map(lambda row: (row.day.year, row.number))
print(d.take(2000))
d_maxNum = d.reduceByKey(lambda max_num, this_num: this_num if this_num > max_num else max_num)
print(d_maxNum.collect())

Answer 1

Pyspark has dropDuplicates method refer that you can use.

df = sc.parallelize([Row(Animal='Dog', Color='White'), Row(Animal='Dog', Color='Black'), Row(Animal='Dog', Color='White'), Row(Animal='Cat', Color='White')]).toDF()

df.dropDuplicates(['Animal', 'Color']).count()

which will give output as 3.

Answer 2

You can use distinct function.

##Perform distinct on entire dataframe.
df.distinct().show()

##Perform distinct on certain columns of dataframe
df.select('Animal','Color').distinct().show()

Pyspark: Get the amount of distinct combinations between two columns

Question

2 answers

solution1
1 2019-12-02 06:46:43

solution2
0 2019-12-02 10:46:31

Pyspark: Get the amount of distinct combinations between two columns

Question

2 answers

solution1 1 2019-12-02 06:46:43

solution2 0 2019-12-02 10:46:31

solution1
1 2019-12-02 06:46:43

solution2
0 2019-12-02 10:46:31