[英]reducebykey is not a member of org.apache.spark.sql.Dataset
[英]Scastie rendering compiler error as “value countByValue is not a member of org.apache.spark.sql.Dataset[String]”
嗨,我正在嘗試使用 scastie 程序查找收視率直方圖……這是實現
scastie 中的 sbet 設置
scalacOptions ++= Seq(
"-deprecation",
"-encoding", "UTF-8",
"-feature",
"-unchecked"
)
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.4.3",
"org.apache.spark" %% "spark-sql" % "2.4.3"
)
scastie 中的實際代碼
import org.apache.spark.sql.SparkSession
import org.apache.spark._
import org.apache.spark.SparkContext._
import org.apache.spark.sql.SparkSession
import org.apache.log4j._
object TestApp extends App {
lazy implicit val spark =
SparkSession.builder().master("local").appName("spark_test").getOrCreate()
import spark.implicits._ // Required to call the .toDF function later
val html = scala.io.Source.fromURL("http://files.grouplens.org/datasets/movielens/ml-
100k/u.data").mkString // Get all rows as one string
val seqOfRecords = html.split("\n") // Split based on the newline characters
.filter(_ != "") // Filter out any empty lines
.toSeq // Convert to Seq so we can convert to DF later
.map(row => row.split("\t"))
.map { case Array(f1,f2,f3,f4) => (f1,f2,f3,f4) }
val df = seqOfRecords.toDF("col1", "col2", "col3", "col4")
val ratings = df.map(x => x.toString().split("\t")(2))
// Count up how many times each value (rating) occurs
val results = ratings.countByValue()
// Sort the resulting map of (rating, count) tuples
val sortedResults = results.toSeq.sortBy(_._1)
// Print each result on its own line.
sortedResults.foreach(println)
spark.close()
}
進入 scastie 時出錯
值 countByValue 不是 org.apache.spark.sql.Dataset[String] 的成員
========================================== 修改后的代碼在 Scastie 中給出了不同的錯誤現在
java.lang.ExceptionInInitializerError
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at org.apache.spark.sql.execution.TakeOrderedAndProjectExec.executeCollect(limit.scala:136)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3383)
at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2544)
at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2544)
at org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3364)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3363)
at org.apache.spark.sql.Dataset.head(Dataset.scala:2544)
at org.apache.spark.sql.Dataset.take(Dataset.scala:2758)
at org.apache.spark.sql.Dataset.getRows(Dataset.scala:254)
at org.apache.spark.sql.Dataset.showString(Dataset.scala:291)
at org.apache.spark.sql.Dataset.show(Dataset.scala:745)
at org.apache.spark.sql.Dataset.show(Dataset.scala:704)
at org.apache.spark.sql.Dataset.show(Dataset.scala:713)
at TestApp$.delayedEndpoint$TestApp$1(main.scala:22)
at TestApp$delayedInit$body.apply(main.scala:4)
at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
at scala.App$$anonfun$main$1.apply(App.scala:76)
at scala.App$$anonfun$main$1.apply(App.scala:76)
at scala.collection.immutable.List.foreach(List.scala:392)
at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
at scala.App$class.main(App.scala:76)
at TestApp$.main(main.scala:4)
at TestApp.main(main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at sbt.Run.invokeMain(Run.scala:115)
at sbt.Run.execute$1(Run.scala:79)
at sbt.Run.$anonfun$runWithLoader$4(Run.scala:92)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at sbt.util.InterfaceUtil$$anon$1.get(InterfaceUtil.scala:10)
at sbt.TrapExit$App.run(TrapExit.scala:257)
at java.lang.Thread.run(Thread.java:748)
Caused by: com.fasterxml.jackson.databind.JsonMappingException: Incompatible Jackson version: 2.9.8
at com.fasterxml.jackson.module.scala.JacksonModule$class.setupModule(JacksonModule.scala:64)
at com.fasterxml.jackson.module.scala.DefaultScalaModule.setupModule(DefaultScalaModule.scala:19)
at com.fasterxml.jackson.databind.ObjectMapper.registerModule(ObjectMapper.java:751)
at org.apache.spark.rdd.RDDOperationScope$.<init>(RDDOperationScope.scala:82)
at org.apache.spark.rdd.RDDOperationScope$.<clinit>(RDDOperationScope.scala)
... 40 more
這是 scastie 中的更新代碼
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.col
object TestApp extends App {
lazy implicit val spark = SparkSession.builder().master("local").appName("spark_test").getOrCreate()
import spark.implicits._ // Required to call the .toDF function later
val html = scala.io.Source.fromURL("http://files.grouplens.org/datasets/movielens/ml-100k/u.data").mkString // Get all rows as one string
val seqOfRecords = html.split("\n") // Split based on the newline characters
.filter(_ != "") // Filter out any empty lines
.toSeq // Convert to Seq so we can convert to DF later
.map(row => row.split("\t")) // Split each line on tab character to make an Array of 4 String each
.map { case Array(f1,f2,f3,f4) => (f1,f2,f3,f4) } // Convert that Array[String] into Array[(String, String, String, String)]
val df = seqOfRecords.toDF("col1", "col2", "col3", "col4") // Give whatever column names you want
df.select("col3").groupBy("col3").count.sort(col("count").desc).show()
spark.close() // don't forget to close(), otherwise scastie won't let you create another session so soon.
}
當您到達ratings
變量時,您正在使用稱為數據集的 Spark 結構。 您可以在此處查看描述它可以做什么和不能做什么的文檔。 它沒有一個叫做countByValue
的方法,這就是為什么你會看到你看到的錯誤。
在你到達這一行之前,你所擁有的一切都是有意義的:
val ratings = df.map(x => x.toString().split("\t")(2))
這當前會產生一個錯誤。
如果你回到df
變量,你有一個看起來像這樣的表:
+----+----+----+---------+
|col1|col2|col3| col4|
+----+----+----+---------+
| 196| 242| 3|881250949|
| 186| 302| 3|891717742|
| 22| 377| 1|878887116|
| 244| 51| 2|880606923|
| 166| 346| 1|886397596|
+----+----+----+---------+
您可以運行命令df.show()
來查看數據集中內容的示例。 從那里我認為你想要一個看起來有點像groupBy
。 看看這方面的一些 例子,看看下一步要去哪里。
您問題的第一部分:因此,您的代碼中的主要問題是嘗試按制表符\\t
進行拆分。 您的記錄不包含任何選項卡,就像我在評論中解釋的那樣。
問題是,當您通過
df
映射時,您正在訪問每個org.apache.spark.sql.Row
對象,例如df.first
是[196,242,3,881250949]
。 您可以將其轉換為String
,但沒有要拆分的\\t
(制表符),因此它將簡單地返回一個String
如只有一個元素的Array[String]
,因此訪問第二個元素返回一個java.lang.ArrayIndexOutOfBoundsException
。
這是一個演示:
// We get the first row and brute force convert it toString()
df.head.toString
//res21: String = [196,242,3,881250949] <- See? No tab anywhere
df.head.toString.split("\t")
//res22: Array[String] = Array([196,242,3,881250949]) <- Returns the string as is in an Array
res22(0)
//res24: String = [196,242,3,881250949] <- First Element
res22(1)
//java.lang.ArrayIndexOutOfBoundsException: 1 <- No second (or third) element found, hence the "out of bounds" exception.
// ... 55 elided
我從您的評論中了解到您正在嘗試獲取第三列。 使用DataFrame
處在於您可以簡單地按名稱select
所需的列。 然后您可以groupBy
它(這將返回一個RelationalGroupedDataset )並使用count
方法進行聚合。
import org.apache.spark.sql.functions.col
df.select("col3").groupBy("col3").count.sort(col("count").desc).show()
//+----+-----+
//|col3|count|
//+----+-----+
//| 4|34174|
//| 3|27145|
//| 5|21201|
//| 2|11370|
//| 1| 6110|
//+----+-----+
問題的第二部分:似乎 Scastie 加載了比 Spark 2.4.3 使用的版本更新的com.fasterxml.jackson.core:jackson-databind
版本,因此雖然 Scastie 似乎使用版本2.9.6
,但 Spark 2.4.3 使用舊版本: 2.6.7
。
我讓它工作的唯一方法是使用更新版本的 Spark 和 Scala。 Spark 3.0.1 使用2.10.0
。
在構建設置下:
Scala Version
設置為2.12.10
。libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "3.0.1",
"org.apache.spark" %% "spark-sql" % "3.0.1"
)
它表現不佳,瀏覽器掛起,有時會超時。 我猜 Scastie 尚未針對此版本進行優化。
編輯:實際上,在我將日志記錄靜音后, 它現在工作得更好了!
但仍然......你真的應該在你的本地計算機上安裝 Spark 。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.