Apache spark and python lambda

Question

I have the following code

file = spark.textFile("hdfs://...")
counts = file.flatMap(lambda line: line.split(" ")) \
             .map(lambda word: (word, 1)) \
             .reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile("hdfs://...")

http://spark.apache.org/examples.html i have copied the example from here

I am unable to understand this code especially the keywords

flatmap,
map and
reduceby

can someone please explain in plain english what's going on.

Answer 1

map is the easiest, it essentially says do the given operation on every element of the sequence and return the resulting sequence (very similar to foreach). flatMap is the same thing but instead of returning just one element per element you are allowed to return a sequence (which can be empty). Here's an answer explaining the difference between map and flatMap . Lastly reduceByKey takes an aggregate function (meaning it takes two arguments of the same type and returns that type, should also be commutative and associative otherwise you will get inconsistent results) which is used to aggregate every V for each K in your sequence of (K,V) pairs.

EXAMPLE ^* :
reduce (lambda a, b: a + b,[1,2,3,4])

This says aggregate the whole list with + so it will do

1 + 2 = 3  
3 + 3 = 6  
6 + 4 = 10  
final result is 10

Reduce by key is the same thing except you do a reduce for each unique key.

So to explain it in your example

file = spark.textFile("hdfs://...") // open text file each element of the RDD is one line of the file
counts = file.flatMap(lambda line: line.split(" ")) //flatMap is needed here to return every word (separated by a space) in the line as an Array
             .map(lambda word: (word, 1)) //map each word to a value of 1 so they can be summed
             .reduceByKey(lambda a, b: a + b) // get an RDD of the count of every unique word by aggregating (adding up) all the 1's you wrote in the last step
counts.saveAsTextFile("hdfs://...") //Save the file onto HDFS

So, why count words this way, the reason is that the MapReduce paradigm of programming is highly parallelizable and thus scales to doing this computation on terabytes or even petabytes of data.

_{I don't use python much tell me if I made a mistake.}

Answer 2

See inline-comments:

file = spark.textFile("hdfs://...") # opens a file
counts = file.flatMap(lambda line: line.split(" ")) \  # iterate over the lines, split each line by space (into words)
             .map(lambda word: (word, 1)) \ # for each word, create the tuple (word, 1)
             .reduceByKey(lambda a, b: a + b) # go over the tuples "by key" (first element) and sum the second elements
counts.saveAsTextFile("hdfs://...")

A more detailed explanation of reduceByKey can be found here

Answer 3

The answers here are accurate at the code level but it may help to understand what goes on under the hood.

My understanding is that when a reduce operation is called there is a massive data shuffle that results in all KV pairs obtained by a map() operation that have the same value of the key being assigned to a task that sums the values in the collection of KV pairs. These tasks are then assigned to different physical processors and the results are then collated with another data shuffle.

so if the map operation produces (cat 1) (cat 1) (dog 1) (cat 1) (cat 1) (dog 1)

The reduce operation produces (cat 4) (dog 2)

Hope this helps

Apache spark and python lambda

Question

3 answers

solution1
13 ACCPTED 2014-07-04 15:59:47

solution2
4 2014-07-04 13:58:19

solution3
1 2014-09-13 23:32:22

Apache spark and python lambda

Question

3 answers

solution1 13 ACCPTED 2014-07-04 15:59:47

solution2 4 2014-07-04 13:58:19

solution3 1 2014-09-13 23:32:22

solution1
13 ACCPTED 2014-07-04 15:59:47

solution2
4 2014-07-04 13:58:19

solution3
1 2014-09-13 23:32:22