簡體   English   中英

將兩個數組列彼此映射,並從數組之一colum Scala中創建列

[英]map two array column to each other and make columns from one of the arrays colum Scala

我正在使用核心NLP中的pos,並且得到如下所示的結果。

import org.apache.spark.sql.functions._  
import com.databricks.spark.corenlp.functions._  

import sqlContext.implicits._  

val input = Seq(  
  (1, "<xml>Stanford University is located in California. It is a great university.</xml>")  
).toDF("id", "text")  
val output = input.select(explode(ssplit('text)).as('sen)).select('sen, tokenize('sen).as('words), pos('sen).as('posTags))

結果如下

    +----------------------------------------------+------------------------------------------------------+-------------------------------+  
    |sen                                           |words                                                 |posatag                        |  
    +----------------------------------------------+------------------------------------------------------+-------------------------------+  
    |Stanford University is located in California .|[Stanford, University, is, located, in, California, .]|[NNP, NNP, VBZ, JJ, IN, NNP, .]|
    |It is a great university .                    |[It, is, a, great, university, .]                     |[PRP, VBZ, DT, JJ, NN, .]      |  
    +----------------------------------------------+------------------------------------------------------+-------------------------------+  

我想映射單詞列和postag列。

+----------------------------------------------+----------------------------------------------------------+-------------------------------+----------------------------------+--------+-------------------------+ 
|    sen                                       |    words                                                 | posatag                       |     NNP                          |  VBZ   |more columns form postag |  
+----------------------------------------------+----------------------------------------------------------+-------------------------------+----------------------------------+--------+-------------------------+  
|Stanford University is located in California .|[Stanford, University, is, located, in, California, .]    |[NNP, NNP, VBZ, JJ, IN, NNP, .]|[Stanford, University,California] | [is]   |                         |  
|It is a great university .                    |[It, is, a, great, university, .]                         |[PRP, VBZ, DT, JJ, NN, .]      | []                               | [is]   |                         |  
+----------------------------------------------+----------------------------------------------------------+-------------------------------+----------------------------------+--------+-------------------------+ 

請幫我得到這個。

我解決了上述問題很長的路要走。 首先我使用以下udf壓縮單詞

def zippPosToWords=udf((list1:Seq[String],list2:Seq[String]) => {
   val zipList = list1 zip list2
   val groupBy_list = zipList.groupBy(_._1).mapValues {_.map {case (_, b)=> b}}
   groupBy_list
}
)

然后我用上述數據創建了一個列

val output1=output.withColumn("postowords", zippPosToWords(col("posTags"), col("words")))

然后將其分解為列。

val finaldf = output1.select(col("*"),explode('postowords))

它會為您提供如下表結構:

 +----------------------------------------------+----------------------------------------------------------+-------------------------------+----------------------------------+--------+-------------------------+ 
|    sen                                       |    words                                                 | posatag                       |     NNP                          |  VBZ   |more columns form postag |  
+----------------------------------------------+----------------------------------------------------------+-------------------------------+----------------------------------+--------+-------------------------+  
|Stanford University is located in California .|[Stanford, University, is, located, in, California, .]    |[NNP, NNP, VBZ, JJ, IN, NNP, .]|[Stanford, University,California] | [is]   |                         |  
|It is a great university .                    |[It, is, a, great, university, .]                         |[PRP, VBZ, DT, JJ, NN, .]      | []                               | [is]   |                         |  
+----------------------------------------------+----------------------------------------------------------+-------------------------------+----------------------------------+--------+-------------------------+ 

希望它能幫助到別人。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM