[英]How to data pre-processing in Spark in this case
I made a follwing dataset with scala. 我用scala制作了一个新的数据集。
+--------------------+---+
| text| docu_no|
+--------------------+---+
|서울,NNP 시내,NNG 한,M...| 1|
|최저,NNG 임금,NNG 때문,...| 2|
|왜,MAG 시급,NNG 만,JX...| 3|
|지금,MAG 경제,NNG 가,J...| 4|
|임대료,NNG 폭리,NNG 내리...| 5|
|모든,MM 문제,NNG 를,JK...| 6|
|니,NP 들,XSN 이,JKS ...| 7|
|실제,NNG 자영업,NNG 자,...| 8|
I want to make DTM for analysis. 我想制作DTM进行分析。 For example
例如
docu_no|서울|시내|한|최저|임금|지금|폭리 ...
1 1 1 1 0 0 0 0
2 0 0 0 1 1 1 1
For this, I thought pre-processing as follows. 为此,我认为预处理如下。
+--------------------+---+
| text|count |docu_no
+--------------------+---+
|서울,NNP | 1| 1
|시내,NNG | 1| 1
|한,M. | 1| 1
|최저,NNG | 1| 2
|임금,NNG| 1| 2
|때문,...| 1| 2
After I make this (rdd or DataSet), if I use group by and pivot, I will get the results that I want to. 进行此操作(rdd或DataSet)之后,如果我使用group by并进行数据透视,则将得到想要的结果。 But it is too difficult for me.
但这对我来说太困难了。 If you have ideas, please inform those to me.
如果您有想法,请告诉我。
val data = List(("A", 1),("B", 2),("C", 3),("E", 4),("F", 5))
val df = sc.parallelize(data).toDF("text","doc_no")
df.show()
+----+------+
|text|doc_no|
+----+------+
| A| 1|
| B| 2|
| C| 3|
| E| 4|
| F| 5|
+----+------+
import org.apache.spark.sql.functions._
df.groupBy($"doc_no").pivot("text").agg(count("doc_no")).show()
+------+---+---+---+---+---+
|doc_no| A| B| C| E| F|
+------+---+---+---+---+---+
| 1| 1| 0| 0| 0| 0|
| 2| 0| 1| 0| 0| 0|
| 3| 0| 0| 1| 0| 0|
| 4| 0| 0| 0| 1| 0|
| 5| 0| 0| 0| 0| 1|
+------+---+---+---+---+---+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.