[英]Failed to execute user defined function in Apache Spark using Scala
我有以下數據幀:
+---------------+-----------+-------------+--------+--------+--------+--------+------+-----+
| time_stamp_0|sender_ip_1|receiver_ip_2|s_port_3|r_port_4|acknum_5|winnum_6| len_7|count|
+---------------+-----------+-------------+--------+--------+--------+--------+------+-----+
|06:36:16.293711| 10.0.0.1| 10.0.0.2| 55518| 5001| 0| 58| 65161| 130|
|06:36:16.293729| 10.0.0.1| 10.0.0.2| 55518| 5001| 0| 58| 65913| 130|
|06:36:16.293743| 10.0.0.1| 10.0.0.2| 55518| 5001| 0| 58|131073| 130|
|06:36:16.293765| 10.0.0.1| 10.0.0.2| 55518| 5001| 0| 58|196233| 130|
|06:36:16.293783| 10.0.0.1| 10.0.0.2| 55518| 5001| 0| 58|196985| 130|
|06:36:16.293798| 10.0.0.1| 10.0.0.2| 55518| 5001| 0| 58|262145| 130|
|06:36:16.293820| 10.0.0.1| 10.0.0.2| 55518| 5001| 0| 58|327305| 130|
|06:36:16.293837| 10.0.0.1| 10.0.0.2| 55518| 5001| 0| 58|328057| 130|
|06:36:16.293851| 10.0.0.1| 10.0.0.2| 55518| 5001| 0| 58|393217| 130|
|06:36:16.293873| 10.0.0.1| 10.0.0.2| 55518| 5001| 0| 58|458377| 130|
|06:36:16.293890| 10.0.0.1| 10.0.0.2| 55518| 5001| 0| 58|459129| 130|
|06:36:16.293904| 10.0.0.1| 10.0.0.2| 55518| 5001| 0| 58|524289| 130|
|06:36:16.293926| 10.0.0.1| 10.0.0.2| 55518| 5001| 0| 58|589449| 130|
|06:36:16.293942| 10.0.0.1| 10.0.0.2| 55518| 5001| 0| 58|590201| 130|
|06:36:16.293956| 10.0.0.1| 10.0.0.2| 55518| 5001| 0| 58|655361| 130|
|06:36:16.293977| 10.0.0.1| 10.0.0.2| 55518| 5001| 0| 58|720521| 130|
|06:36:16.293994| 10.0.0.1| 10.0.0.2| 55518| 5001| 0| 58|721273| 130|
|06:36:16.294007| 10.0.0.1| 10.0.0.2| 55518| 5001| 0| 58|786433| 130|
|06:36:16.294028| 10.0.0.1| 10.0.0.2| 55518| 5001| 0| 58|851593| 130|
|06:36:16.294045| 10.0.0.1| 10.0.0.2| 55518| 5001| 0| 58|852345| 130|
+---------------+-----------+-------------+--------+--------+--------+--------+------+-----+
only showing top 20 rows
我必須在我的dataframe
添加功能和標簽來預測計數值。 但是當我運行代碼時,我會看到以下錯誤:
Failed to execute user defined function(anonfun$15: (int, int, string, string, int, int, int, int, int) => vector)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
我還cast(IntegerType)
我的所有功能,但再次發生錯誤。 這是我的代碼:
val Frist_Dataframe = sqlContext.createDataFrame(Row_Dstream_Train, customSchema)
val toVec9 = udf[Vector, Int, Int, String, String, Int, Int, Int, Int, Int] { (a, b, c, d, e, f, g, h, i) =>
val e3 = c match {
case "10.0.0.1" => 1
case "10.0.0.2" => 2
case "10.0.0.3" => 3
}
val e4 = d match {
case "10.0.0.1" => 1
case "10.0.0.2" => 2
case "10.0.0.3" => 3
}
Vectors.dense(a, b, e3, e4, e, f, g, h, i)
}
val final_df = Dataframe.withColumn(
"features",
toVec9(
// casting into Timestamp to parse the string, and then into Int
$"time_stamp_0".cast(TimestampType).cast(IntegerType),
$"count".cast(IntegerType),
$"sender_ip_1",
$"receiver_ip_2",
$"s_port_3".cast(IntegerType),
$"r_port_4".cast(IntegerType),
$"acknum_5".cast(IntegerType),
$"winnum_6".cast(IntegerType),
$"len_7".cast(IntegerType)
)
).withColumn("label", (Dataframe("count"))).select("features", "label")
final_df.show()
val trainingTest = final_df.randomSplit(Array(0.8, 0.2))
val TrainingDF = trainingTest(0).toDF()
val TestingDF=trainingTest(1).toDF()
TrainingDF.show()
TestingDF.show()
我的依賴項也是:
libraryDependencies ++= Seq(
"co.theasi" %% "plotly" % "0.2.0",
"org.apache.spark" %% "spark-core" % "2.1.1",
"org.apache.spark" %% "spark-sql" % "2.1.1",
"org.apache.spark" %% "spark-hive" % "2.1.1",
"org.apache.spark" %% "spark-streaming" % "2.1.1",
"org.apache.spark" %% "spark-mllib" % "2.1.1"
)
最有趣的一點是,如果我在代碼的最后一部分中將所有的轉換cast(IntegerType)
更改為cast(IntegerType)
轉換cast(TimestampType).cast(IntegerType)
,則錯誤消失,輸出將如下所示:
+--------+-----+
|features|label|
+--------+-----+
| null| 130|
| null| 130|
| null| 130|
| null| 130|
| null| 130|
| null| 130|
| null| 130|
| null| 130|
| null| 130|
| null| 130|
| null| 130|
| null| 130|
| null| 130|
| null| 130|
| null| 130|
| null| 130|
| null| 130|
| null| 130|
| null| 130|
| null| 130|
+--------+-----+
更新 :在應用@Ramesh Maharjan解決方案之后,我的數據幀的結果運行良好但是,每當我嘗試將我的final_df數據幀拆分為訓練並測試結果時,如下所示,我仍然有同樣的問題,即有空行。
+--------------------+-----+
| features|label|
+--------------------+-----+
| null| 130|
| null| 130|
| null| 130|
| null| 130|
| null| 130|
| null| 130|
| null| 130|
| null| 130|
|[1.497587776E9,13...| 130|
|[1.497587776E9,13...| 130|
|[1.497587776E9,13...| 130|
|[1.497587776E9,13...| 130|
|[1.497587776E9,13...| 130|
|[1.497587776E9,13...| 130|
|[1.497587776E9,13...| 130|
|[1.497587776E9,13...| 130|
|[1.497587776E9,13...| 130|
|[1.497587776E9,13...| 130|
|[1.497587776E9,13...| 130|
|[1.497587776E9,13...| 130|
+--------------------+-----+
你能幫助我嗎?
我沒有在您的問題代碼中看到生成count column
。 除了count
專欄@ Shankar的回答應該可以得到你想要的結果。
以下錯誤是由於錯誤的定義udf
函數@Shankar曾在他的回答糾正。
Failed to execute user defined function(anonfun$15: (int, int, string, string, int, int, int, int, int) => vector)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
以下錯誤是由於spark-mllib library
與spark-core library
和spark-sql library
version
不匹配造成的。 它們都應該是相同的版本。
error: Caused by: org.apache.spark.SparkException: Failed to execute user defined function(anonfun$15: (int, int, string, string, int, int, int, int, int) => vector) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
我希望解釋清楚,並希望看到你的問題很快得到解決。
編輯
您還沒有還改變了udf
功能@Shankar了開來。 添加.trim
因為我可以看到一些空格
val toVec9 = udf ((a: Int, b: Int, c: String, d: String, e: Int, f: Int, g: Int, h: Int, i: Int) =>
{
val e3 = c.trim match {
case "10.0.0.1" => 1
case "10.0.0.2" => 2
case "10.0.0.3" => 3
}
val e4 = d.trim match {
case "10.0.0.1" => 1
case "10.0.0.2" => 2
case "10.0.0.3" => 3
}
Vectors.dense(a, b, e3, e4, e, f, g, h, i)
})
並查看您的依賴項,您正在使用%%
,它告訴sbt
下載在您的系統中使用scala
版本打包的dependencies
。 這應該沒問題但是由於你仍然遇到錯誤,我想將dependencies
更改為
libraryDependencies ++= Seq(
"co.theasi" %% "plotly" % "0.2.0",
"org.apache.spark" % "spark-core_2.11" % "2.1.1",
"org.apache.spark" % "spark-sql_2.11" % "2.1.1",
"org.apache.spark" %% "spark-hive" % "2.1.1",
"org.apache.spark" % "spark-streaming_2.11" % "2.1.1",
"org.apache.spark" % "spark-mllib_2.11" % "2.1.1"
)
我認為這就是你如何創建一個udf
val toVec9 = udf ((a: Int, b: Int, c: String, d: String, e: Int, f: Int, g: Int, h: Int, i: Int) =>
{
val e3 = c match {
case "10.0.0.1" => 1
case "10.0.0.2" => 2
case "10.0.0.3" => 3
}
val e4 = d match {
case "10.0.0.1" => 1
case "10.0.0.2" => 2
case "10.0.0.3" => 3
}
Vectors.dense(a, b, e3, e4, e, f, g, h, i)
})
並用它作為
val final_df = Dataframe.withColumn(
"features",
toVec9(
// casting into Timestamp to parse the string, and then into Int
$"time_stamp_0".cast(TimestampType).cast(IntegerType),
$"count".cast(IntegerType),
$"sender_ip_1",
$"receiver_ip_2",
$"s_port_3".cast(IntegerType),
$"r_port_4".cast(IntegerType),
$"acknum_5".cast(IntegerType),
$"winnum_6".cast(IntegerType),
$"len_7".cast(IntegerType)
)
).withColumn("label", (Dataframe("count"))).select("features", "label")
希望這可以幫助!
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.