简体   繁体   中英

How to join two RDDs : value join is not a member of org.apache.spark.rdd.RDD[org.apache.spark.sql.Row]

I'm using Spark 2.1.0 and Scala 2.10.6

When I try to do this :

val x = (avroRow1).join(flattened)

I get the error :

value join is not a member of org.apache.spark.rdd.RDD[org.apache.spark.sql.Row]

Why do I get this message? I have the following import statements :

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.rdd.RDD
import org.apache.spark.rdd.PairRDDFunctions

import org.apache.spark.sql._
import com.databricks.spark.avro._
import org.apache.spark.sql.functions.map
import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions.col

This is my code :

 val avroRow = spark.read.avro(inputString).rdd
  val avroParsed = avroRow
    .map(x => new TRParser(x))
    .map((obj: TRParser) => {
      val tId = obj.source.trim
      var retVal: String = ""
      obj.ids
        .foreach((obj: TRUEntry) => {
          retVal += tId + "," + obj.uId.trim + ":"
        })
        retVal.dropRight(1)
    })

 val flattened = avroParsed
 .flatMap(x => x.split(":"))
 .map(y => ((y),1)).reduceByKey(_+_)  .map { case (a, b) => {
val Array(first, second) = a.split(",")
((first, second), b)
 }}.saveAsTextFile(outputString)


 val avroRow1 = spark.read.avro(inputString1).rdd
  val avroParsed1 = avroRow1
    .map(x => new TLParser(x))
    .map((obj: TLParser) => ((obj.source, obj.uid, obj.chmon))) .map { case (a, b, c) => ((a, b), c) }
    .saveAsTextFile(outputString1)


    val x = (avroParsed1).join(flattened)

UPDATE

This is my sample output for avroRow1 :

((p872301075,fb_100004351878884),37500)
((p162506011,fb_100006956538970),-200000)

This is my sample output for flattened :

((p872301075,fb_100004351878884),2)
(p162506011,fb_100006956538970),1)

This is the output I'm trying to get :

((p872301075,fb_100004351878884),37500,2)
(p162506011,fb_100006956538970),-200000,1)

join() operation is only available on PairedRDD.In your case it is not paired RDD. The reason being , you need a common key to join two RDDs but with generic RDD it is not possible. Try convert your avroRow1 & flattened into (key,value) and then perform join.

A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable, partitioned collection of elements that can be operated on in parallel. This class contains the basic operations available on all RDDs, such as map, filter, and persist. In addition, org.apache.spark.rdd.PairRDDFunctions contains operations available only on RDDs of key-value pairs, such as groupByKey and join;

val avroRow1 = spark.read.avro(inputString1).rdd

Here you are converting a DF to rdd. You should convert avroRow1 to (key,value) pair. Then apply join operation.

.saveAsTextFile(outputString) can cause problems because this changes the return type of the variables. Rather than saving as individual files before joining, the RDDs can persist() and the final output can be saved by .saveAsTextFile in this manner :-

 val flattened = avroParsed
 .flatMap(x => x.split(":"))
 .map(y => ((y),1)).reduceByKey(_+_)  .map { case (a, b) => {
val Array(first, second) = a.split(",")
((first, second), b)
}}   


 val avroRow1 = spark.read.avro(inputString1).rdd
  val avroParsed1 = avroRow1
    .map(x => new TLParser(x))
    .map((obj: TLParser) => ((obj.source, obj.uid, obj.chmon))) .map { case (a, b, c) => ((a, b), c) }


val res = avroParsed1.join(flattened).saveAsTextFile(outputString1);

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM