無法使用Scala在Apache Spark中執行用戶定義的函數

Question

我有以下數據幀：

+---------------+-----------+-------------+--------+--------+--------+--------+------+-----+
|   time_stamp_0|sender_ip_1|receiver_ip_2|s_port_3|r_port_4|acknum_5|winnum_6| len_7|count|
+---------------+-----------+-------------+--------+--------+--------+--------+------+-----+
|06:36:16.293711|   10.0.0.1|     10.0.0.2|   55518|    5001|       0|      58| 65161|  130|
|06:36:16.293729|   10.0.0.1|     10.0.0.2|   55518|    5001|       0|      58| 65913|  130|
|06:36:16.293743|   10.0.0.1|     10.0.0.2|   55518|    5001|       0|      58|131073|  130|
|06:36:16.293765|   10.0.0.1|     10.0.0.2|   55518|    5001|       0|      58|196233|  130|
|06:36:16.293783|   10.0.0.1|     10.0.0.2|   55518|    5001|       0|      58|196985|  130|
|06:36:16.293798|   10.0.0.1|     10.0.0.2|   55518|    5001|       0|      58|262145|  130|
|06:36:16.293820|   10.0.0.1|     10.0.0.2|   55518|    5001|       0|      58|327305|  130|
|06:36:16.293837|   10.0.0.1|     10.0.0.2|   55518|    5001|       0|      58|328057|  130|
|06:36:16.293851|   10.0.0.1|     10.0.0.2|   55518|    5001|       0|      58|393217|  130|
|06:36:16.293873|   10.0.0.1|     10.0.0.2|   55518|    5001|       0|      58|458377|  130|
|06:36:16.293890|   10.0.0.1|     10.0.0.2|   55518|    5001|       0|      58|459129|  130|
|06:36:16.293904|   10.0.0.1|     10.0.0.2|   55518|    5001|       0|      58|524289|  130|
|06:36:16.293926|   10.0.0.1|     10.0.0.2|   55518|    5001|       0|      58|589449|  130|
|06:36:16.293942|   10.0.0.1|     10.0.0.2|   55518|    5001|       0|      58|590201|  130|
|06:36:16.293956|   10.0.0.1|     10.0.0.2|   55518|    5001|       0|      58|655361|  130|
|06:36:16.293977|   10.0.0.1|     10.0.0.2|   55518|    5001|       0|      58|720521|  130|
|06:36:16.293994|   10.0.0.1|     10.0.0.2|   55518|    5001|       0|      58|721273|  130|
|06:36:16.294007|   10.0.0.1|     10.0.0.2|   55518|    5001|       0|      58|786433|  130|
|06:36:16.294028|   10.0.0.1|     10.0.0.2|   55518|    5001|       0|      58|851593|  130|
|06:36:16.294045|   10.0.0.1|     10.0.0.2|   55518|    5001|       0|      58|852345|  130|
+---------------+-----------+-------------+--------+--------+--------+--------+------+-----+
only showing top 20 rows

我必須在我的dataframe添加功能和標簽來預測計數值。 但是當我運行代碼時，我會看到以下錯誤：

Failed to execute user defined function(anonfun$15: (int, int, string, string, int, int, int, int, int) => vector)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)

我還cast(IntegerType)我的所有功能，但再次發生錯誤。 這是我的代碼：

val Frist_Dataframe = sqlContext.createDataFrame(Row_Dstream_Train, customSchema)

       val toVec9 = udf[Vector, Int, Int, String, String, Int, Int, Int, Int, Int] { (a, b, c, d, e, f, g, h, i) =>
              val e3 = c match {
                case "10.0.0.1" => 1
                case "10.0.0.2" => 2
                case "10.0.0.3" => 3
              }

              val e4 = d match {
                case "10.0.0.1" => 1
                case "10.0.0.2" => 2
                case "10.0.0.3" => 3
              }
              Vectors.dense(a, b, e3, e4, e, f, g, h, i)
            }

            val final_df = Dataframe.withColumn(
              "features",
              toVec9(
                // casting into Timestamp to parse the string, and then into Int
                $"time_stamp_0".cast(TimestampType).cast(IntegerType),
                $"count".cast(IntegerType),
                $"sender_ip_1",
                $"receiver_ip_2",
                $"s_port_3".cast(IntegerType),
                $"r_port_4".cast(IntegerType),
                $"acknum_5".cast(IntegerType),
                $"winnum_6".cast(IntegerType),
                $"len_7".cast(IntegerType)
              )
            ).withColumn("label", (Dataframe("count"))).select("features", "label")

final_df.show（）

val trainingTest = final_df.randomSplit(Array(0.8, 0.2))
val TrainingDF = trainingTest(0).toDF()
val TestingDF=trainingTest(1).toDF()
TrainingDF.show()
TestingDF.show()

我的依賴項也是：

libraryDependencies ++= Seq(
  "co.theasi" %% "plotly" % "0.2.0",
  "org.apache.spark" %% "spark-core" % "2.1.1",
  "org.apache.spark" %% "spark-sql" % "2.1.1",
  "org.apache.spark" %% "spark-hive" % "2.1.1",
  "org.apache.spark" %% "spark-streaming" % "2.1.1",
  "org.apache.spark" %% "spark-mllib" % "2.1.1"
)

最有趣的一點是，如果我在代碼的最后一部分中將所有的轉換cast(IntegerType)更改為cast(IntegerType)轉換cast(TimestampType).cast(IntegerType) ，則錯誤消失，輸出將如下所示：

+--------+-----+
|features|label|
+--------+-----+
|    null|  130|
|    null|  130|
|    null|  130|
|    null|  130|
|    null|  130|
|    null|  130|
|    null|  130|
|    null|  130|
|    null|  130|
|    null|  130|
|    null|  130|
|    null|  130|
|    null|  130|
|    null|  130|
|    null|  130|
|    null|  130|
|    null|  130|
|    null|  130|
|    null|  130|
|    null|  130|
+--------+-----+

更新：在應用@Ramesh Maharjan解決方案之后，我的數據幀的結果運行良好但是，每當我嘗試將我的final_df數據幀拆分為訓練並測試結果時，如下所示，我仍然有同樣的問題，即有空行。

+--------------------+-----+
|            features|label|
+--------------------+-----+
|                null|  130|
|                null|  130|
|                null|  130|
|                null|  130|
|                null|  130|
|                null|  130|
|                null|  130|
|                null|  130|
|[1.497587776E9,13...|  130|
|[1.497587776E9,13...|  130|
|[1.497587776E9,13...|  130|
|[1.497587776E9,13...|  130|
|[1.497587776E9,13...|  130|
|[1.497587776E9,13...|  130|
|[1.497587776E9,13...|  130|
|[1.497587776E9,13...|  130|
|[1.497587776E9,13...|  130|
|[1.497587776E9,13...|  130|
|[1.497587776E9,13...|  130|
|[1.497587776E9,13...|  130|
+--------------------+-----+

你能幫助我嗎？

Answer 1

我沒有在您的問題代碼中看到生成count column 。 除了count專欄@ Shankar的回答應該可以得到你想要的結果。

以下錯誤是由於錯誤的定義udf函數@Shankar曾在他的回答糾正。

Failed to execute user defined function(anonfun$15: (int, int, string, string, int, int, int, int, int) => vector)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)

以下錯誤是由於spark-mllib library與spark-core library和spark-sql library version不匹配造成的。 它們都應該是相同的版本。

error: Caused by: org.apache.spark.SparkException: Failed to execute user defined function(anonfun$15: (int, int, string, string, int, int, int, int, int) => vector) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$Gen‌eratedIterator.proce‌ssNext(Unknown Source)

我希望解釋清楚，並希望看到你的問題很快得到解決。

編輯

您還沒有還改變了udf功能@Shankar了開來。 添加.trim因為我可以看到一些空格

val toVec9 = udf ((a: Int, b: Int, c: String, d: String, e: Int, f: Int, g: Int, h: Int, i: Int) =>
  {
  val e3 = c.trim match {
    case "10.0.0.1" => 1
    case "10.0.0.2" => 2
    case "10.0.0.3" => 3
  }
  val e4 = d.trim match {
    case "10.0.0.1" => 1
    case "10.0.0.2" => 2
    case "10.0.0.3" => 3
  }
  Vectors.dense(a, b, e3, e4, e, f, g, h, i)
})

並查看您的依賴項，您正在使用%% ，它告訴sbt下載在您的系統中使用scala版本打包的dependencies 。 這應該沒問題但是由於你仍然遇到錯誤，我想將dependencies更改為

libraryDependencies ++= Seq(
  "co.theasi" %% "plotly" % "0.2.0",
  "org.apache.spark" % "spark-core_2.11" % "2.1.1",
  "org.apache.spark" % "spark-sql_2.11" % "2.1.1",
  "org.apache.spark" %% "spark-hive" % "2.1.1",
  "org.apache.spark" % "spark-streaming_2.11" % "2.1.1",
  "org.apache.spark" % "spark-mllib_2.11" % "2.1.1"

)

Answer 2

我認為這就是你如何創建一個udf

val toVec9 = udf ((a: Int, b: Int, c: String, d: String, e: Int, f: Int, g: Int, h: Int, i: Int) =>
{
  val e3 = c match {
    case "10.0.0.1" => 1
    case "10.0.0.2" => 2
    case "10.0.0.3" => 3
  }

  val e4 = d match {
    case "10.0.0.1" => 1
    case "10.0.0.2" => 2
    case "10.0.0.3" => 3
  }
  Vectors.dense(a, b, e3, e4, e, f, g, h, i)

})

並用它作為

val final_df = Dataframe.withColumn(
              "features",
              toVec9(
                // casting into Timestamp to parse the string, and then into Int
                $"time_stamp_0".cast(TimestampType).cast(IntegerType),
                $"count".cast(IntegerType),
                $"sender_ip_1",
                $"receiver_ip_2",
                $"s_port_3".cast(IntegerType),
                $"r_port_4".cast(IntegerType),
                $"acknum_5".cast(IntegerType),
                $"winnum_6".cast(IntegerType),
                $"len_7".cast(IntegerType)
              )
            ).withColumn("label", (Dataframe("count"))).select("features", "label")

希望這可以幫助！

無法使用Scala在Apache Spark中執行用戶定義的函數

問題描述

2 個解決方案

解決方案1
3 已采納 2017-06-15 16:36:30

解決方案2
0 2017-06-15 15:19:07

無法使用Scala在Apache Spark中執行用戶定義的函數

問題描述

2 個解決方案

解決方案1 3 已采納 2017-06-15 16:36:30

解決方案2 0 2017-06-15 15:19:07

解決方案1
3 已采納 2017-06-15 16:36:30

解決方案2
0 2017-06-15 15:19:07