Spark Scala DF。在处理同一列的某些行时将新列添加到DF

Question

Dears, I'm New on SparK Scala, and, I have a DF of two columns: "UG" and "Counts" and I like to obtain the Third How was exposed in thsi list. 亲爱的，我是SparK Scala的新手，并且我有两列DF：“ UG”和“ Counts”，我希望获得此列表中公开的Third How。

DF: UG, Counts, CUG ( the columns) DF：UG，计数，CUG（各列）

of 12 4 共12 4
of 23 4 共23 4
the 134 3 134 3
love 68 2 爱68 2
pain 3 1 痛苦3 1
the 18 3 18 3
love 100 2 爱100 2
of 23 4 共23 4
the 12 3 12 3
of 11 4 共11 4

I need to add a new column called "CUG", the third one exposed, where CUG(i) is the number of times that the string(i) in UG appears in the whole Column. 我需要添加一个称为“ CUG”的新列，其中第三个暴露出来，其中CUG（i）是UG中的string（i）出现在整个列中的次数。

I tried with the following scheme: 我尝试了以下方案：

Having the DF like the previous table in df. 像df中的上一张表一样具有DF。 I did a sql UDF function to count the number of times that the string appear in the column "UG", that is: 我做了一个sql UDF函数来计算字符串在“ UG”列中出现的次数，即：

val NW1 = (w1:String) => { 
  df.filter($"UG".like(w1.substring(1,(w1.length-1))).count() 
}:Long
val sqlfunc = udf(NW1)
val df2= df.withColumn("CUG",sqlfunc(col("UG")))

But when I tried, ... It did'nt work . 但是当我尝试时，... 没有用 。 I obtained an error of Null Point exception. 我得到了Null Point异常错误。 The UDF scheme worked isolated but not with in DF. UDF方案孤立地工作，但在DF中不起作用。 What can I do in order to obtain the asked results using DF. 我该怎么办才能使用DF获得要求的结果。

Thanks In advance. 提前致谢。 jm3 jm3

Answer 1

So what you can do is firstly count the number of rows grouped by the UG column which gives the third column you need, and then join with the original data frame. 因此，您可以做的是首先计算按UG列分组的行数，该列将提供您需要的第三列，然后与原始数据帧合并。 You can rename the column name if you want with the withColumnRenamed function. 如果需要，可以使用withColumnRenamed函数来重命名列名称。

scala> import org.apache.spark.sql.functions._

scala> myDf.show()
+----+------+
|  UG|Counts|
+----+------+
|  of|    12|
|  of|    23|
| the|   134|
|love|    68|
|pain|     3|
| the|    18|
|love|   100|
|  of|    23|
| the|    12|
|  of|    11|
+----+------+     


scala> myDf.join(myDf.groupBy("UG").count().withColumnRenamed("count", "CUG"), "UG").show()
+----+------+---+
|  UG|Counts|CUG|
+----+------+---+
|  of|    12|  4|
|  of|    23|  4|
| the|   134|  3|
|love|    68|  2|
|pain|     3|  1|
| the|    18|  3|
|love|   100|  2|
|  of|    23|  4|
| the|    12|  3|
|  of|    11|  4|
+----+------+---+

Spark Scala DF。在处理同一列的某些行时将新列添加到DF

问题描述

DF: UG, Counts, CUG ( the columns) DF：UG，计数，CUG（各列）

1 个解决方案

解决方案1
0 2016-05-09 00:15:31

Spark Scala DF。 在处理同一列的某些行时将新列添加到DF

问题描述

DF: UG, Counts, CUG ( the columns) DF：UG，计数，CUG（各列）

1 个解决方案

解决方案1 0 2016-05-09 00:15:31

Spark Scala DF。在处理同一列的某些行时将新列添加到DF

解决方案1
0 2016-05-09 00:15:31