Scala将[Seq [string]转换为[String]？（定格后的TF-IDF）

Question

I try to learn scala and specificaly text minning (lemmatization ,TF-IDF matrix and LSA). 我尝试学习scala，特别是文本挖掘（词法化，TF-IDF矩阵和LSA）。

I have some texts i want to lemmatize and make a classification (LSA). 我有一些文本想作定形并进行分类（LSA）。 I use spark on cloudera. 我在cloudera上使用spark。

So i used the stanfordCore NLP fonction: 因此，我使用了stanfordCore NLP功能：

    def plainTextToLemmas(text: String, stopWords: Set[String]): Seq[String] = {
    val props = new Properties()
    props.put("annotators", "tokenize, ssplit, pos, lemma")
    val pipeline = new StanfordCoreNLP(props)
    val doc = new Annotation(text)
    pipeline.annotate(doc)
    val lemmas = new ArrayBuffer[String]()
    val sentences = doc.get(classOf[SentencesAnnotation])
    for (sentence <- sentences; token <-sentence.get(classOf[TokensAnnotation])) {
    val lemma = token.get(classOf[LemmaAnnotation])
    if (lemma.length > 2 && !stopWords.contains(lemma)) {
    lemmas += lemma.toLowerCase
    }
    }
    lemmas
    }

After that, i try to make an TF-IDF matrix but here is my problem: The Stanford fonction make an RDD in [Seq[string] form. 之后，我尝试制作一个TF-IDF矩阵，但这是我的问题：Stanford函数以[Seq [string]形式制作RDD。 But, i have an error. 但是，我有一个错误。 I need to use a RDD in [String] form (not the [Seq[string]] form). 我需要以[String]形式（而不是[Seq [string]]形式）使用RDD。

val (termDocMatrix, termIds, docIds, idfs) = termDocumentMatrix(lemmatized-text, stopWords, numTerms, sc)

Someone know how convert a [Seq[string]] to [String]? 有人知道如何将[Seq [string]]转换为[String]吗？

Or i need to change one of my request?. 或者我需要更改我的请求之一？

Thanks for the help. 谢谢您的帮助。 Sorry if it's a dumb question and for the english. 抱歉，这是一个愚蠢的问题，对于英语。

Bye 再见

Answer 1

I am not sure what this lemmatization thingy is, but as far as making a string out of a sequence, you can just do seq.mkString("\\n") (or replace "\\n" with whatever other separator you want), or just seq.mkString if you want it merged without any separator. 我不确定这种去词性是什么，但是就使字符串脱离序列而言，您可以执行seq.mkString("\\n") （或将“ \\ n”替换为所需的任何其他分隔符），或只是seq.mkString如果要合并而没有任何分隔符）。

Also, don't use mutable structures, it's bad taste in scala: 另外，不要使用可变结构，这在scala中是不好的味道：

val lemmas = sentences
  .map(_.get(classOf[TokensAnnotation]))
  .map(_.get(classOf[LemmaAnnotation]))
  .filter(_.length > 2)
  .filterNot(stopWords)
  .mkString

Scala将[Seq [string]转换为[String]？（定格后的TF-IDF）

问题描述

1 个解决方案

解决方案1
0 2017-07-16 13:51:27

Scala将[Seq [string]转换为[String]？ （定格后的TF-IDF）

问题描述

1 个解决方案

解决方案1 0 2017-07-16 13:51:27

Scala将[Seq [string]转换为[String]？（定格后的TF-IDF）

解决方案1
0 2017-07-16 13:51:27