如何计算Spark中文档的词频？

Question

I'm working on a document classification algorithm in Spark. 我正在研究Spark中的文档分类算法。 I want to create a dictionary from terms in each to-be-classified document. 我想根据每个要分类的文档中的术语创建字典。 Here is what I have so far: 这是我到目前为止的内容：

def tokenize(content: String): Seq[String] = {

val tReader = new StringReader(content)
val analyzer = new EnglishAnalyzer(LuceneVersion)
val tStream = analyzer.tokenStream("contents",tReader)
val term = tStream.addAttribute(classOf[CharTermAttribute])
tStream.reset()

val result = mutable.ArrayBuffer.empty[String]
while(tStream.incrementToken()){
  result += term.toString()
}   
result

} }

This function takes a string and tokenizes and stems it and return a Seq[String] and this this how I can this function 这个函数接受一个字符串，并对其进行标记化和词干化，并返回一个Seq[String] ，这就是我如何才能使用此函数

val testInstance = sc.textFile("to-be-classified.txt")
testInstance.flatMap(line1 => tokenize(line1)).map(line2 => (line2,1))

This is as far as I've gone. 据我所知。 Can someone help me creating a Dictionary type of structure that has the 'term' as the key and its freq as the 'value'? 有人可以帮我创建一个以“项”为键，其频率为“值”的Dictionary结构吗？

EDIT: I think of a better approach however I can't quite write it. 编辑：我想一个更好的方法，但是我不能完全写它。 Here is some part: 这是一部分：

 case class doc_terms(filename:String, terms:List[Pair])

Then my idea is to create an object of class doc_terms for each document I read. 然后我的想法是为我阅读的每个文档创建一个doc_terms类的对象。 It contains a list of all terms for a document. 它包含文档的所有术语的列表。 Then do a reduce by key in which I should find the frequency of each term for each document. 然后进行归约键操作，在该键中，我应该找到每个文档的每个术语的频率。 At the end I'll have an RDD in which each entity is like (file1,[('term1',12),('term2',23)...]). 最后，我将有一个RDD，其中每个实体都类似于（file1，[（'term1'，12），（'term2'，23）...]）。 Can someone help me writing this? 有人可以帮我写这个吗？

Answer 1

OK so I found two ways to do this. 好的，所以我找到了两种方法。

I am going to use a simplified tokenizer, you can replace my tokenizer with something more complex and everything should still run. 我将使用简化的令牌生成器，您可以用更复杂的令牌替换令牌生成器，并且一切仍应运行。

For text data I am using a text file of the novel War and Peace 对于文本数据，我使用的是小说《战争与和平》的文本文件

Note that I've changed around the exact classes a bit to keep the types compatible. 请注意，为了保持类型的兼容，我对确切的类做了一些更改。 The term count function is called study with a single parameter (the input file) and returns a type DocTerms 术语计数函数被调用study与单个参数（输入文件），并返回一个类型DocTerms

Method 1 方法1

import scala.collection.immutable.WrappedString;
import scala.collection.Map

def tokenize(line:String):Array[String] =
    new WrappedString(line).toLowerCase().split(' ')

case class DocTerms(filename:String, terms:Map[String,Int])

def study(filename:String):DocTerms = {
    val counts = (sc
    .textFile(filename)
    .flatMap(tokenize)
    .map( (s:String) => (s,1) )
    .reduceByKey( _ + _ )
    .collectAsMap()
    )
    DocTerms(filename, counts)
}

val book1 = study("/data/warandpeace.txt")

for(c<-book1.terms.slice(20)) println(c)

output: 输出：

(worried,12)
(matthew.,1)
(follow,32)
(lost--than,1)
(diseases,1)
(reports.,1)
(scoundrel?,1)
(but--i,1)
(road--is,2)
(well-garnished,1)
(napoleon;,2)
(passion,,2)
(nataly,2)
(entreating,2)
(sounding,1)
(any?,1)
("sila,1)
(can,",3)
(motionless,22)

Note that this output is not sorted and Map types in general are not sortable but they are fast for lookups and Dictionary-like. 请注意，此输出未排序，并且Map类型通常不可排序，但它们对于查找和类似于字典的速度很快。 Although only 20 elements were printed, all terms were counted and stored in the book1 object that has type DocTerms 虽然只有20个元素被印时，所有术语进行计数，并存储在book1具有类型对象DocTerms

Method 2 方法二

Alternatively, the terms portion of DocTerms could be made a type List[(String,Int)] and sorted (at some computation cost) before being returned so that the most numerous terms appear first. 或者，可以将DocTerms的terms部分DocTerms List[(String,Int)]类型List[(String,Int)]并在返回之前进行排序（以一定的计算成本），以便首先出现最多的条款。 But that means it will not be a Map or fast lookup dictionary. 但这意味着它不会是Map或快速查找字典。 However, for some uses a list-like type might be preferable. 但是，对于某些用途，类似列表的类型可能更可取。

import scala.collection.immutable.WrappedString;

def tokenize(line:String):Array[String] =
    new WrappedString(line).toLowerCase().split(' ')

case class DocTerms(filename:String, terms:List[(String,Int)])

def study(filename:String):DocTerms = {
    val counts = (sc
        .textFile(filename)
        .flatMap(tokenize)
        .map( (s:String) => (s,1) )
        .reduceByKey( _ + _ )
        .sortBy[Int]( (pair:Tuple2[String,Int]) => -pair._2 )
        .collect()
        )
    DocTerms(filename, counts.toList)
}

val book1 = study("/data/warandpeace.txt")

for(c<-book1.terms.slice(1,100)) println(c)

Output 输出量

(and,21403)
(to,16502)
(of,14903)
(,13598)
(a,10413)
(he,9296)
(in,8607)
(his,7932)
(that,7417)
(was,7202)
(with,5649)
(had,5334)
(at,4495)
(not,4481)
(her,3963)
(as,3913)
(it,3898)
(on,3666)
(but,3619)
(for,3390)
(i,3226)
(she,3225)
(is,3037)
(him,2733)
(you,2681)
(from,2661)
(all,2495)
(said,2406)
(were,2356)
(by,2354)
(be,2316)
(they,2062)
(who,1939)
(what,1935)
(which,1932)
(have,1931)
(one,1855)
(this,1836)
(prince,1705)
(an,1617)
(so,1577)
(or,1553)
(been,1460)
(their,1435)
(did,1428)
(when,1426)
(would,1339)
(up,1296)
(pierre,1261)
(only,1250)
(are,1183)
(if,1165)
(my,1135)
(could,1095)
(there,1094)
(no,1057)
(out,1048)
(into,998)
(now,957)
(will,954)
(them,942)
(more,939)
(about,919)
(went,846)
(how,841)
(we,838)
(some,826)
(him.,826)
(after,817)
(do,814)
(man,778)
(old,773)
(your,765)
(very,762)
("i,755)
(chapter,731)
(princess,726)
(him,,716)
(then,706)
(andrew,700)
(like,691)
(himself,687)
(natasha,683)
(has,677)
(french,671)
(without,665)
(came,662)
(before,658)
(me,657)
(began,654)
(looked,642)
(time,641)
(those,639)
(know,623)
(still,612)
(our,610)
(face,609)
(thought,608)
(see,605)

You might notice that the most common words are not very interesting. 您可能会注意到最常见的单词不是很有趣。 But we also have words like "prince", "princess", "andrew", "natasha", and "french" which are probably more specific to War and Peace . 但是我们也有诸如“王子”，“公主”，“安德鲁”，“娜塔莎”和“法语”之类的词，这些词可能更特定于战争与和平 。

To reduce the weight on common words once you have a bunch of documents, for scaling people often use TFIDF, or "term frequency inverse document frequency", which means each term's count is basically divided by the number of documents in the corpus in which it appears (or some similar function involving logs). 为了减少一堆文件中常见单词的权重，为了进行缩放，人们经常使用TFIDF（即术语频率与文档频率成反比），这意味着每个术语的计数基本上除以它所包含的语料库中的文档数出现（或一些类似的涉及日志的功能）。 But that's a topic for another question. 但这是另一个问题的话题。

如何计算Spark中文档的词频？

问题描述

1 个解决方案

解决方案1
1 2015-08-05 07:42:18

Method 1 方法1

Method 2 方法二

如何计算Spark中文档的词频？

问题描述

1 个解决方案

解决方案1 1 2015-08-05 07:42:18

Method 1 方法1

Method 2 方法二

解决方案1
1 2015-08-05 07:42:18