简体   繁体   中英

Scala Convert [Seq[string] to [String]? (TF-IDF after lemmatization)

I try to learn scala and specificaly text minning (lemmatization ,TF-IDF matrix and LSA).

I have some texts i want to lemmatize and make a classification (LSA). I use spark on cloudera.

So i used the stanfordCore NLP fonction:

    def plainTextToLemmas(text: String, stopWords: Set[String]): Seq[String] = {
    val props = new Properties()
    props.put("annotators", "tokenize, ssplit, pos, lemma")
    val pipeline = new StanfordCoreNLP(props)
    val doc = new Annotation(text)
    pipeline.annotate(doc)
    val lemmas = new ArrayBuffer[String]()
    val sentences = doc.get(classOf[SentencesAnnotation])
    for (sentence <- sentences; token <-sentence.get(classOf[TokensAnnotation])) {
    val lemma = token.get(classOf[LemmaAnnotation])
    if (lemma.length > 2 && !stopWords.contains(lemma)) {
    lemmas += lemma.toLowerCase
    }
    }
    lemmas
    }

After that, i try to make an TF-IDF matrix but here is my problem: The Stanford fonction make an RDD in [Seq[string] form. But, i have an error. I need to use a RDD in [String] form (not the [Seq[string]] form).

val (termDocMatrix, termIds, docIds, idfs) = termDocumentMatrix(lemmatized-text, stopWords, numTerms, sc)

Someone know how convert a [Seq[string]] to [String]?

Or i need to change one of my request?.

Thanks for the help. Sorry if it's a dumb question and for the english.

Bye

I am not sure what this lemmatization thingy is, but as far as making a string out of a sequence, you can just do seq.mkString("\\n") (or replace "\\n" with whatever other separator you want), or just seq.mkString if you want it merged without any separator.

Also, don't use mutable structures, it's bad taste in scala:

val lemmas = sentences
  .map(_.get(classOf[TokensAnnotation]))
  .map(_.get(classOf[LemmaAnnotation]))
  .filter(_.length > 2)
  .filterNot(stopWords)
  .mkString

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM