Spark Scala DF. add a new Column to DF based in processing of some rows of the same column

Question

Dears, I'm New on SparK Scala, and, I have a DF of two columns: "UG" and "Counts" and I like to obtain the Third How was exposed in thsi list.

DF: UG, Counts, CUG ( the columns)

of 12 4
of 23 4
the 134 3
love 68 2
pain 3 1
the 18 3
love 100 2
of 23 4
the 12 3
of 11 4

I need to add a new column called "CUG", the third one exposed, where CUG(i) is the number of times that the string(i) in UG appears in the whole Column.

I tried with the following scheme:

Having the DF like the previous table in df. I did a sql UDF function to count the number of times that the string appear in the column "UG", that is:

val NW1 = (w1:String) => { 
  df.filter($"UG".like(w1.substring(1,(w1.length-1))).count() 
}:Long
val sqlfunc = udf(NW1)
val df2= df.withColumn("CUG",sqlfunc(col("UG")))

But when I tried, ... It did'nt work . I obtained an error of Null Point exception. The UDF scheme worked isolated but not with in DF. What can I do in order to obtain the asked results using DF.

Thanks In advance. jm3

Answer 1

So what you can do is firstly count the number of rows grouped by the UG column which gives the third column you need, and then join with the original data frame. You can rename the column name if you want with the withColumnRenamed function.

scala> import org.apache.spark.sql.functions._

scala> myDf.show()
+----+------+
|  UG|Counts|
+----+------+
|  of|    12|
|  of|    23|
| the|   134|
|love|    68|
|pain|     3|
| the|    18|
|love|   100|
|  of|    23|
| the|    12|
|  of|    11|
+----+------+     


scala> myDf.join(myDf.groupBy("UG").count().withColumnRenamed("count", "CUG"), "UG").show()
+----+------+---+
|  UG|Counts|CUG|
+----+------+---+
|  of|    12|  4|
|  of|    23|  4|
| the|   134|  3|
|love|    68|  2|
|pain|     3|  1|
| the|    18|  3|
|love|   100|  2|
|  of|    23|  4|
| the|    12|  3|
|  of|    11|  4|
+----+------+---+

Spark Scala DF. add a new Column to DF based in processing of some rows of the same column

Question

DF: UG, Counts, CUG ( the columns)

1 answers

solution1
0 2016-05-09 00:15:31

Spark Scala DF. add a new Column to DF based in processing of some rows of the same column

Question

DF: UG, Counts, CUG ( the columns)

1 answers

solution1 0 2016-05-09 00:15:31

solution1
0 2016-05-09 00:15:31