Scala將[Seq [string]轉換為[String]？（定格后的TF-IDF）

Question

我嘗試學習scala，特別是文本挖掘（詞法化，TF-IDF矩陣和LSA）。

我有一些文本想作定形並進行分類（LSA）。 我在cloudera上使用spark。

因此，我使用了stanfordCore NLP功能：

    def plainTextToLemmas(text: String, stopWords: Set[String]): Seq[String] = {
    val props = new Properties()
    props.put("annotators", "tokenize, ssplit, pos, lemma")
    val pipeline = new StanfordCoreNLP(props)
    val doc = new Annotation(text)
    pipeline.annotate(doc)
    val lemmas = new ArrayBuffer[String]()
    val sentences = doc.get(classOf[SentencesAnnotation])
    for (sentence <- sentences; token <-sentence.get(classOf[TokensAnnotation])) {
    val lemma = token.get(classOf[LemmaAnnotation])
    if (lemma.length > 2 && !stopWords.contains(lemma)) {
    lemmas += lemma.toLowerCase
    }
    }
    lemmas
    }

之后，我嘗試制作一個TF-IDF矩陣，但這是我的問題：Stanford函數以[Seq [string]形式制作RDD。 但是，我有一個錯誤。 我需要以[String]形式（而不是[Seq [string]]形式）使用RDD。

val (termDocMatrix, termIds, docIds, idfs) = termDocumentMatrix(lemmatized-text, stopWords, numTerms, sc)

有人知道如何將[Seq [string]]轉換為[String]嗎？

或者我需要更改我的請求之一？

謝謝您的幫助。 抱歉，這是一個愚蠢的問題，對於英語。

再見

Answer 1

我不確定這種去詞性是什么，但是就使字符串脫離序列而言，您可以執行seq.mkString("\\n") （或將“ \\ n”替換為所需的任何其他分隔符），或只是seq.mkString如果要合並而沒有任何分隔符）。

另外，不要使用可變結構，這在scala中是不好的味道：

val lemmas = sentences
  .map(_.get(classOf[TokensAnnotation]))
  .map(_.get(classOf[LemmaAnnotation]))
  .filter(_.length > 2)
  .filterNot(stopWords)
  .mkString

Scala將[Seq [string]轉換為[String]？（定格后的TF-IDF）

問題描述

1 個解決方案

解決方案1
0 2017-07-16 13:51:27

Scala將[Seq [string]轉換為[String]？ （定格后的TF-IDF）

問題描述

1 個解決方案

解決方案1 0 2017-07-16 13:51:27

Scala將[Seq [string]轉換為[String]？（定格后的TF-IDF）

解決方案1
0 2017-07-16 13:51:27