[英]Scala Convert [Seq[string] to [String]? (TF-IDF after lemmatization)
I try to learn scala and specificaly text minning (lemmatization ,TF-IDF matrix and LSA). 我尝试学习scala,特别是文本挖掘(词法化,TF-IDF矩阵和LSA)。
I have some texts i want to lemmatize and make a classification (LSA). 我有一些文本想作定形并进行分类(LSA)。 I use spark on cloudera.
我在cloudera上使用spark。
So i used the stanfordCore NLP fonction: 因此,我使用了stanfordCore NLP功能:
def plainTextToLemmas(text: String, stopWords: Set[String]): Seq[String] = {
val props = new Properties()
props.put("annotators", "tokenize, ssplit, pos, lemma")
val pipeline = new StanfordCoreNLP(props)
val doc = new Annotation(text)
pipeline.annotate(doc)
val lemmas = new ArrayBuffer[String]()
val sentences = doc.get(classOf[SentencesAnnotation])
for (sentence <- sentences; token <-sentence.get(classOf[TokensAnnotation])) {
val lemma = token.get(classOf[LemmaAnnotation])
if (lemma.length > 2 && !stopWords.contains(lemma)) {
lemmas += lemma.toLowerCase
}
}
lemmas
}
After that, i try to make an TF-IDF matrix but here is my problem: The Stanford fonction make an RDD in [Seq[string] form. 之后,我尝试制作一个TF-IDF矩阵,但这是我的问题:Stanford函数以[Seq [string]形式制作RDD。 But, i have an error.
但是,我有一个错误。 I need to use a RDD in [String] form (not the [Seq[string]] form).
我需要以[String]形式(而不是[Seq [string]]形式)使用RDD。
val (termDocMatrix, termIds, docIds, idfs) = termDocumentMatrix(lemmatized-text, stopWords, numTerms, sc)
Someone know how convert a [Seq[string]] to [String]? 有人知道如何将[Seq [string]]转换为[String]吗?
Or i need to change one of my request?. 或者我需要更改我的请求之一?
Thanks for the help. 谢谢您的帮助。 Sorry if it's a dumb question and for the english.
抱歉,这是一个愚蠢的问题,对于英语。
Bye 再见
I am not sure what this lemmatization thingy is, but as far as making a string out of a sequence, you can just do seq.mkString("\\n")
(or replace "\\n" with whatever other separator you want), or just seq.mkString
if you want it merged without any separator. 我不确定这种去词性是什么,但是就使字符串脱离序列而言,您可以执行
seq.mkString("\\n")
(或将“ \\ n”替换为所需的任何其他分隔符),或只是seq.mkString
如果要合并而没有任何分隔符)。
Also, don't use mutable structures, it's bad taste in scala: 另外,不要使用可变结构,这在scala中是不好的味道:
val lemmas = sentences
.map(_.get(classOf[TokensAnnotation]))
.map(_.get(classOf[LemmaAnnotation]))
.filter(_.length > 2)
.filterNot(stopWords)
.mkString
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.