简体繁体中英

How does result changes by using .distinct() in spark?

原文 2015-06-17 08:35:36 8 2 python/ apache-spark

I was working with Apache Log File. And I created RDD with tuple (day,host) from each log line. Next step was to Group up host and then display the result.

I used distinct() with mapping of first RDD into (day,host) tuple. When I don't use distinct I get different result as when I do. So how does a result change when using distinct() in spark??

2 answers

Distinct removes the duplicate entries for a particular key. Your count should reduce or remain same after applying distinct.

http://spark.apache.org/docs/0.7.3/api/pyspark/pyspark.rdd.RDD-class.html#distinct

I think when you only use map action on FIRST_RDD(logs) you will get SECOND_RDD count of new this SECOND_RDD will be equal to count of FIRST_RDD.
But if you use distinct on SECOND_RDD, count will decrease to number of distinct tuples present in SECOND_RDD.

spark 2.0.0 select distinct unstable result

Distinct and sum aggregation in Spark using one command

Does spark's distinct() function shuffle only the distinct tuples from each partition

How to use approx_count_distinct to count distinct combinations of two columns in a Spark DataFrame?

How displaye Spark result in GUI (Tkinter)

how to get bigquery sql result in Spark dataframe?

python spark distinct command fails

Python spark : How to parellelize Spark Dataframe compute using spark in databricks

How to return *result without using result[0], result[1]

Create timeseries from groupby result in Spark dataframe using PySpark

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question spark 2.0.0 select distinct unstable result Distinct and sum aggregation in Spark using one command Does spark's distinct() function shuffle only the distinct tuples from each partition How to use approx_count_distinct to count distinct combinations of two columns in a Spark DataFrame? How displaye Spark result in GUI (Tkinter) how to get bigquery sql result in Spark dataframe? python spark distinct command fails Python spark : How to parellelize Spark Dataframe compute using spark in databricks How to return *result without using result[0], result[1] Create timeseries from groupby result in Spark dataframe using PySpark

Related Tags

How does result changes by using .distinct() in spark?

Question

2 answers

solution1
1 2015-06-21 02:21:07

solution2
0 2015-06-17 09:07:21

How does result changes by using .distinct() in spark?

Question

2 answers

solution1 1 2015-06-21 02:21:07

solution2 0 2015-06-17 09:07:21

solution1
1 2015-06-21 02:21:07

solution2
0 2015-06-17 09:07:21