![](/img/trans.png)
[英]Scala: Given a TSV file, take the 1st two columns in each line and return the following: Array[Map[column_one:String, column_two:String]]
[英]map two array column to each other and make columns from one of the arrays colum Scala
我正在使用核心NLP中的pos,並且得到如下所示的結果。
import org.apache.spark.sql.functions._
import com.databricks.spark.corenlp.functions._
import sqlContext.implicits._
val input = Seq(
(1, "<xml>Stanford University is located in California. It is a great university.</xml>")
).toDF("id", "text")
val output = input.select(explode(ssplit('text)).as('sen)).select('sen, tokenize('sen).as('words), pos('sen).as('posTags))
結果如下
+----------------------------------------------+------------------------------------------------------+-------------------------------+
|sen |words |posatag |
+----------------------------------------------+------------------------------------------------------+-------------------------------+
|Stanford University is located in California .|[Stanford, University, is, located, in, California, .]|[NNP, NNP, VBZ, JJ, IN, NNP, .]|
|It is a great university . |[It, is, a, great, university, .] |[PRP, VBZ, DT, JJ, NN, .] |
+----------------------------------------------+------------------------------------------------------+-------------------------------+
我想映射單詞列和postag列。
+----------------------------------------------+----------------------------------------------------------+-------------------------------+----------------------------------+--------+-------------------------+
| sen | words | posatag | NNP | VBZ |more columns form postag |
+----------------------------------------------+----------------------------------------------------------+-------------------------------+----------------------------------+--------+-------------------------+
|Stanford University is located in California .|[Stanford, University, is, located, in, California, .] |[NNP, NNP, VBZ, JJ, IN, NNP, .]|[Stanford, University,California] | [is] | |
|It is a great university . |[It, is, a, great, university, .] |[PRP, VBZ, DT, JJ, NN, .] | [] | [is] | |
+----------------------------------------------+----------------------------------------------------------+-------------------------------+----------------------------------+--------+-------------------------+
請幫我得到這個。
我解決了上述問題很長的路要走。 首先我使用以下udf壓縮單詞
def zippPosToWords=udf((list1:Seq[String],list2:Seq[String]) => {
val zipList = list1 zip list2
val groupBy_list = zipList.groupBy(_._1).mapValues {_.map {case (_, b)=> b}}
groupBy_list
}
)
然后我用上述數據創建了一個列
val output1=output.withColumn("postowords", zippPosToWords(col("posTags"), col("words")))
然后將其分解為列。
val finaldf = output1.select(col("*"),explode('postowords))
它會為您提供如下表結構:
+----------------------------------------------+----------------------------------------------------------+-------------------------------+----------------------------------+--------+-------------------------+
| sen | words | posatag | NNP | VBZ |more columns form postag |
+----------------------------------------------+----------------------------------------------------------+-------------------------------+----------------------------------+--------+-------------------------+
|Stanford University is located in California .|[Stanford, University, is, located, in, California, .] |[NNP, NNP, VBZ, JJ, IN, NNP, .]|[Stanford, University,California] | [is] | |
|It is a great university . |[It, is, a, great, university, .] |[PRP, VBZ, DT, JJ, NN, .] | [] | [is] | |
+----------------------------------------------+----------------------------------------------------------+-------------------------------+----------------------------------+--------+-------------------------+
希望它能幫助到別人。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.