import spark.implicits._ is unused

Question

I have a dataframe which I've created using the following code

val SomeCsv = spark.read.option("header", "true").
  csv(conf.getString("data.path.Somecsv")).toDF()

I have a function(that does nothing so far) that looks like this.

def cleanUp(data: sql.DataFrame): sql.DataFrame = {
  data.map({
    doc =>
      (
        doc

        )
  })
}

which breaks on compilation with the error:

"Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._"

I have the import statement set up as other posts have suggested.

val spark = SparkSession.builder...etc
import spark.implicits._

The import statement is flagged as unused by IntelliJ

My guess is that

1.) the csv loading code is using some encoder that is an object rather than primitives.

2.)and/or I need to be specifying datatypes of the dataframe in my function statement like you do with RDD's? I couldn't find any information on this in the Spark documentation.

EDIT

If I instead use

val SomeOtherCsv = SomeCsv.map(t => t(0) + "foobar")

the import statement triggers and everything compiles nicely. My issue now is that the method version(above) on the same data still breaks.

EDIT2

Here is the MCVE

import org.apache.spark._
import org.apache.spark.SparkContext
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql._/*statement unused*/
import com.typesafe.config.ConfigFactory

object main {
  def main(args: Array[String]) = {
    /*load spark conf*/
    val sparkConf = new SparkConf().setAppName("main")
    val sc = new SparkContext(sparkConf)
    /*load configure tool*/
    val conf = ConfigFactory.load()
    /*load spark session*/
    val spark = SparkSession.builder.
      master("local")
      .appName("tester")
      .getOrCreate()
    import spark.implicits._/* is used for val ProcessedGenomeCsv but not testFunction*/
    /*load genome csv as dataframe, conf.getString points to application.conf which contains a local directory for the csv file*/
    val GenomeCsv = spark.read.option("header", "true").
      csv(conf.getString("data.path.genomecsv")).toDF()
    /*cleans up segment names in csv so the can be matched to amino data*/
    def testFunctionOne(data: sql.DataFrame): sql.DataFrame = {/* breaks with import spark.implicits._ error, error points to next line "data.map"*/
      data.map({
        doc =>
          (
            doc

            )
      })
    }
    val ProcessedGenomeCsv = GenomeCsv.map(t => t(12) + "foobar")/* breaks when adding sqlContext and sqlContext.implicits._, is fine otherwise*/
    val FunctionProcessedGenomCsv = testFunctionOne(GenomeCsv)
    ProcessedGenomeCsv.take(1).foreach(println)
    FunctionProcessedGenomCsv.take(1).foreach(println)
  }
}

Answer 1

You want sqlContext.implicits._

You want to declare it after you create the sqlContext (which is already created for you in spark-shell, but not in spark-submit)

You want it to look like this:

object Driver {
    def main(args: Array[String]):Unit = {
        val spark_conf =
          new SparkConf()
            .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
            .setAppName("Spark Tika HDFS")
        val sc = new SparkContext(spark_conf)

        import sqlContext.implicits._

        val df = ....

    }
}

import spark.implicits._ is unused

Question

1 answers

solution1
0 2016-09-28 22:38:08

import spark.implicits._ is unused

Question

1 answers

solution1 0 2016-09-28 22:38:08

solution1
0 2016-09-28 22:38:08