I have a dataframe which I've created using the following code
val SomeCsv = spark.read.option("header", "true").
csv(conf.getString("data.path.Somecsv")).toDF()
I have a function(that does nothing so far) that looks like this.
def cleanUp(data: sql.DataFrame): sql.DataFrame = {
data.map({
doc =>
(
doc
)
})
}
which breaks on compilation with the error:
"Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._"
I have the import statement set up as other posts have suggested.
val spark = SparkSession.builder...etc
import spark.implicits._
The import statement is flagged as unused by IntelliJ
My guess is that
1.) the csv loading code is using some encoder that is an object rather than primitives.
2.)and/or I need to be specifying datatypes of the dataframe in my function statement like you do with RDD's? I couldn't find any information on this in the Spark documentation.
EDIT
If I instead use
val SomeOtherCsv = SomeCsv.map(t => t(0) + "foobar")
the import statement triggers and everything compiles nicely. My issue now is that the method version(above) on the same data still breaks.
EDIT2
Here is the MCVE
import org.apache.spark._
import org.apache.spark.SparkContext
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql._/*statement unused*/
import com.typesafe.config.ConfigFactory
object main {
def main(args: Array[String]) = {
/*load spark conf*/
val sparkConf = new SparkConf().setAppName("main")
val sc = new SparkContext(sparkConf)
/*load configure tool*/
val conf = ConfigFactory.load()
/*load spark session*/
val spark = SparkSession.builder.
master("local")
.appName("tester")
.getOrCreate()
import spark.implicits._/* is used for val ProcessedGenomeCsv but not testFunction*/
/*load genome csv as dataframe, conf.getString points to application.conf which contains a local directory for the csv file*/
val GenomeCsv = spark.read.option("header", "true").
csv(conf.getString("data.path.genomecsv")).toDF()
/*cleans up segment names in csv so the can be matched to amino data*/
def testFunctionOne(data: sql.DataFrame): sql.DataFrame = {/* breaks with import spark.implicits._ error, error points to next line "data.map"*/
data.map({
doc =>
(
doc
)
})
}
val ProcessedGenomeCsv = GenomeCsv.map(t => t(12) + "foobar")/* breaks when adding sqlContext and sqlContext.implicits._, is fine otherwise*/
val FunctionProcessedGenomCsv = testFunctionOne(GenomeCsv)
ProcessedGenomeCsv.take(1).foreach(println)
FunctionProcessedGenomCsv.take(1).foreach(println)
}
}
You want sqlContext.implicits._
You want to declare it after you create the sqlContext (which is already created for you in spark-shell, but not in spark-submit)
You want it to look like this:
object Driver {
def main(args: Array[String]):Unit = {
val spark_conf =
new SparkConf()
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.setAppName("Spark Tika HDFS")
val sc = new SparkContext(spark_conf)
import sqlContext.implicits._
val df = ....
}
}
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.