I'm working with Spark 2.1.1 and Scala 2.11.8
I'm executing my code in Spark-shell. This is the code I'm executing
val read_file1 = sc.textFile("Path to file 1");
val uid = read_file1.map(line => line.split(",")).map(array => array.map(arr => {
| if(arr.contains(":")) (array(2).split(":")(0), arr.split(":")(0))
| else (array(2).split(":")(0), arr)}))
val rdd1 = uid.map(array => array.drop(4)).flatMap(array => array.toSeq).map(y=>(y,1)).reduceByKey(_+_)
My output of this code is :
(( v67430612_serv78i, fb_201906266952256),1)
(( v74005958_serv35i, fb_128431994336303),1)
However for the two RDDs' outputs, when I execute :
uid2.map(x => ((x._1, x._2), x._3)).join(rdd1).map(y => ((y._1._1, y._1._2, y._2._1), y._2._2))
I get the error :
"java.lang.UnsupportedOperationException: empty collection"
Why am I getting this error?
Here are samples of the input files:-
File 1 :
2017-05-09 21:52:42 , 1494391962 , p69465323_serv80i:10:450 , 7 , fb_406423006398063:396560, guest_861067032060185_android:671051, fb_100000829486587:186589, fb_100007900293502:407374, fb_172395756592775:649795
2017-05-09 21:52:42 , 1494391962 , z67265107_serv77i:4:45 , 2:Re , fb_106996523208498:110066, fb_274049626104849:86632, fb_111857069377742:69348, fb_127277511127344:46246
File 2 :
fb_100008724660685,302502,-450,v300430479_serv73i:10:450,switchtable,2017-04-30 00:00:00
fb_190306964768414,147785,-6580,r308423810_serv31i::20,invite,2017-04-30 00:00:00
I just noted this : When I'm executing
rdd1.take(10).foreach(println) or rdd1.first()
I get this message too before the output :
WARN Executor: Managed memory leak detected; size = 39979424 bytes, TID = 11
I don't know if this might have anything to do with the problem??
Another note : this error only occurs when I do
res.first()
for
uid2.map(x => ((x._1, x._2), x._3)).join(rdd1).map(y => ((y._1._1, y._1._2, y._2._1), y._2._2))
On doing
res.take(10).foreach(println)
I don't get any output but no error is returned either.
You forgot to trim
the spaces in the tuples created from splitted line so nothing was joined as they didn't match. So when you tried take
from an empty rdd
, exception was thrown.
You can use following solution. Its working in mine.
val read_file1 = sc.textFile("Path to file 1");
val uid = read_file1.map(line => line.split(",")).map(array => array.map(arr => {
if(arr.contains(":")) (array(2).split(":")(0).trim, arr.split(":")(0).trim)
else (array(2).split(":")(0).trim, arr.trim)}))
val rdd1 = uid.map(array => array.drop(4)).flatMap(array => array.toSeq).map(y=>(y,1)).reduceByKey(_+_)
val read_file2 = sc.textFile("Path to File 2");
val uid2 = read_file2.map(line => {var arr = line.split(","); (arr(3).split(":")(0).trim,arr(0).trim,arr(2).trim)});
val res = uid2.map(x => ((x._1, x._2), x._3)).join(rdd1).map(y => ((y._1._1, y._1._2, y._2._1), y._2._2))
res.take(10).foreach(println)
You get an empty collection after the join
, it happens when there are now corresponding keys in rdds. Either keys are not trimmed, sliced incorrectly or there were not any matches at all. I suggest checking if there are matching keys in your files/rdds, checking if the data was extracted correctly and checking if you need inner join
rather than left
or right outer join
.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.