How to data pre-processing in Spark in this case

Question

I made a follwing dataset with scala.

+--------------------+---+
|                  text| docu_no|
+--------------------+---+
|서울,NNP 시내,NNG 한,M...|  1|
|최저,NNG 임금,NNG 때문,...|  2|
|왜,MAG 시급,NNG 만,JX...|  3|
|지금,MAG 경제,NNG 가,J...|  4|
|임대료,NNG 폭리,NNG 내리...|  5|
|모든,MM 문제,NNG 를,JK...|  6|
|니,NP 들,XSN 이,JKS ...|  7|
|실제,NNG 자영업,NNG 자,...|  8|

I want to make DTM for analysis. For example

docu_no|서울|시내|한|최저|임금|지금|폭리 ...
1    1    1  1  0   0   0    0
2    0    0  0  1   1   1    1

For this, I thought pre-processing as follows.

+--------------------+---+
|       text|count |docu_no
+--------------------+---+
|서울,NNP |  1|  1
|시내,NNG |  1|  1
|한,M.    |  1|  1
|최저,NNG |   1|  2
|임금,NNG|    1|  2
|때문,...|    1|  2

After I make this (rdd or DataSet), if I use group by and pivot, I will get the results that I want to. But it is too difficult for me. If you have ideas, please inform those to me.

Answer 1

val data = List(("A", 1),("B", 2),("C", 3),("E", 4),("F", 5))

val df = sc.parallelize(data).toDF("text","doc_no")
df.show()

+----+------+
|text|doc_no|
+----+------+
|   A|     1|
|   B|     2|
|   C|     3|
|   E|     4|
|   F|     5|
+----+------+

import org.apache.spark.sql.functions._
df.groupBy($"doc_no").pivot("text").agg(count("doc_no")).show()
+------+---+---+---+---+---+
|doc_no|  A|  B|  C|  E|  F|
+------+---+---+---+---+---+
|     1|  1|  0|  0|  0|  0|
|     2|  0|  1|  0|  0|  0|
|     3|  0|  0|  1|  0|  0|
|     4|  0|  0|  0|  1|  0|
|     5|  0|  0|  0|  0|  1|
+------+---+---+---+---+---+

How to data pre-processing in Spark in this case

Question

1 answers

solution1
-1 ACCPTED 2018-08-20 06:02:20

How to data pre-processing in Spark in this case

Question

1 answers

solution1 -1 ACCPTED 2018-08-20 06:02:20

solution1
-1 ACCPTED 2018-08-20 06:02:20