Key/Value pair RDD

Question

I have a question on key/value pair RDD.

I have five files in the C:/download/input folder which has the dialogs in the films as the content of the files as follows:

movie_horror_Conjuring.txt
movie_comedy_eurotrip.txt
movie_horror_insidious.txt
movie_sci-fi_Interstellar.txt
movie_horror_evildead.txt

I am trying to read the files in the input folder using the sc.wholeTextFiles() where i get the key/value as follows

(C:/download/input/movie_horror_Conjuring.txt,values)

I am trying to do an operation where i have to group the input files of each genre together using groupByKey() . The values of all the horror movies together , comedy movies together and so on.

Is there any way i can generate the key/value pair this way (horror, values) instead of (C:/download/input/movie_horror_Conjuring.txt,values)

val ipfile = sc.wholeTextFiles("C:/download/input")
val output = ipfile.groupByKey().map(t => (t._1,t._2))

The above code is giving me the output as follows

(C:/download/input/movie_horror_Conjuring.txt,values)
(C:/download/input/movie_comedy_eurotrip.txt,values)
(C:/download/input/movie_horror_Conjuring.txt,values)
(C:/download/input/movie_sci-fi_Interstellar.txt,values)
(C:/download/input/movie_horror_evildead.txt,values)

where as i need the output as follows :

(horror, (values1, values2, values3))
(comedy, (values1))
(sci-fi, (values1))

I also tried to do some map and split operations to remove the folder paths of the key to get only the file name, but i'm not able to append the corresponding values to the files.

Also i would like to know how can i get the lines count in values1, values2, values3 etc.

My final output should be like

(horror, 100)

where 100 is the sum of the count of lines in values1 = 40 lines, values2 = 30 lines and values3 = 30 lines and so on..

Answer 1

Try this:

 val output = ipfile.map{case (k, v) => (k.split("_")(1),v)}.groupByKey()    
 output.collect

Let me know if this works for you!

Update:

To get output in the format of (horror, 100) :

val output = ipfile.map{case (k, v) => (k.split("_")(1),v.count(_ == '\n'))}.reduceByKey(_ + _)    
output.collect

Key/Value pair RDD

Question

1 answers

solution1
1 2016-09-22 13:08:36

Key/Value pair RDD

Question

1 answers

solution1 1 2016-09-22 13:08:36

solution1
1 2016-09-22 13:08:36