简体   繁体   中英

How does result changes by using .distinct() in spark?

I was working with Apache Log File. And I created RDD with tuple (day,host) from each log line. Next step was to Group up host and then display the result.

I used distinct() with mapping of first RDD into (day,host) tuple. When I don't use distinct I get different result as when I do. So how does a result change when using distinct() in spark??

Distinct removes the duplicate entries for a particular key. Your count should reduce or remain same after applying distinct.

http://spark.apache.org/docs/0.7.3/api/pyspark/pyspark.rdd.RDD-class.html#distinct

I think when you only use map action on FIRST_RDD(logs) you will get SECOND_RDD count of new this SECOND_RDD will be equal to count of FIRST_RDD.
But if you use distinct on SECOND_RDD, count will decrease to number of distinct tuples present in SECOND_RDD.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM