简体   繁体   English

在Scala中将文件内容存储在不可变Map中

[英]Storing the contents of a file in an immutable Map in scala

I am trying to implement a simple wordcount in scala using an immutable map(this is intentional) and the way I am trying to accomplish it is as follows: 我正在尝试使用不可变映射在scala中实现一个简单的单词计数(这是有意的),而我试图实现的方式如下:

  1. Create an empty immutable map 创建一个空的不可变地图
  2. Create a scanner that reads through the file. 创建一个读取文件的扫描仪。
  3. While the scanner.hasNext() is true: 当scanner.hasNext()为true时:

    • Check if the Map contains the word, if it doesn't contain the word, initialize the count to zero 检查Map是否包含单词,如果不包含单词,则将计数初始化为零
    • Create a new entry with the key=word and the value=count+1 使用关键字=单词和值= count + 1创建一个新条目
    • Update the map 更新地图
  4. At the end of the iteration, the map is populated with all the values. 在迭代结束时,将使用所有值填充映射。

My code is as follows: 我的代码如下:

val wordMap = Map.empty[String,Int]
val input = new java.util.scanner(new java.io.File("textfile.txt"))
while(input.hasNext()){
  val token = input.next()
  val currentCount = wordMap.getOrElse(token,0) + 1
  val wordMap = wordMap + (token,currentCount)
}

The ides is that wordMap will have all the wordCounts at the end of the iteration... Whenever I try to run this snippet, I get the following exception 想法是wordMap在迭代结束时将具有所有wordCounts ...每当我尝试运行此代码段时,都会收到以下异常

recursive value wordMap needs type. 递归值wordMap需要类型。

Can somebody point out why I am getting this exception and what can I do to remedy it? 有人可以指出为什么我会收到此异常吗?该如何解决?

Thanks 谢谢

val wordMap = wordMap + (token,currentCount)

This line is redefining an already-defined variable. 该行正在重新定义一个已经定义的变量。 If you want to do this, you need to define wordMap with var and then just use 如果要执行此操作,则需要使用var定义wordMap ,然后使用

wordMap = wordMap + (token,currentCount)

Though how about this instead?: 虽然呢?

io.Source.fromFile("textfile.txt")            // read from the file
  .getLines.flatMap{ line =>                  // for each line
     line.split("\\s+")                       // split the line into tokens
       .groupBy(identity).mapValues(_.size)   // count each token in the line
  }                                           // this produces an iterator of token counts
  .toStream                                   // make a Stream so we can groupBy
  .groupBy(_._1).mapValues(_.map(_._2).sum)   // combine all the per-line counts
  .toList

Note that the per-line pre-aggregation is used to try and reduce the memory required. 请注意,每行预聚合用于尝试减少所需的内存。 Counting across the entire file at once might be too big. 一次计算整个文件可能太大。

If your file is really massive, I would suggest using doing this in parallel (since word counting is trivial to parallelize) using either Scala's parallel collections or Hadoop (using one of the cool Scala Hadoop wrappers like Scrunch or Scoobi). 如果你的文件确实是巨大的,我会建议使用或者Scala的并行集合或Hadoop的(使用酷斯卡拉Hadoop的包装像揉皱或Scoobi之一)并行使用这样做(因为字数统计是微不足道的并行)。

EDIT : Detailed explanation: 编辑 :详细说明:

Ok, first look at the inner part of the flatMap. 好的,首先看一下flatMap的内部。 We take a string, and split it apart on whitespace: 我们取一个字符串,并将其在空白处分开:

val line = "a b c b"
val tokens = line.split("\\s+") // Array(a, b, c, a, b)

Now identity is a function that just returns its argument, so if we groupBy(identity)`, we map each distinct word type , to each word token : 现在identity is a function that just returns its argument, so if we groupBy(identity)`,我们会将每个不同的单词类型映射到每个单词标记

val grouped = tokens.groupBy(identity) // Map(c -> Array(c), a -> Array(a), b -> Array(b, b))

And finally, we want to count up the number of tokens for each type: 最后,我们要计算每种类型的令牌数量:

val counts = grouped.mapValues(_.size) // Map(c -> 1, a -> 1, b -> 2)

Since we map this over all the lines in the file, we end up with token counts for each line. 由于我们将其映射到文件中的所有行,因此最终得到每行的令牌计数。

So what does flatMap do? 那么flatMap是什么? Well, it runs the token-counting function over each line, and then combines all the results into one big collection. 好吧,它在每一行上运行令牌计数功能,然后将所有结果组合到一个大集合中。

Assume the file is: 假设文件为:

a b c b
b c d d d
e f c

Then we get: 然后我们得到:

val countsByLine = 
  io.Source.fromFile("textfile.txt")            // read from the file
    .getLines.flatMap{ line =>                  // for each line
       line.split("\\s+")                       // split the line into tokens
         .groupBy(identity).mapValues(_.size)   // count each token in the line
    }                                           // this produces an iterator of token counts
println(countsByLine.toList) // List((c,1), (a,1), (b,2), (c,1), (d,3), (b,1), (c,1), (e,1), (f,1))

So now we need to combine the counts of each line into one big set of counts. 因此,现在我们需要将每一行的计数合并为一组大计数。 The countsByLine variable is an Iterator , so it doesn't have a groupBy method. countsByLine变量是Iterator ,因此它没有groupBy方法。 Instead we can convert it to a Stream , which is basically a lazy list. 相反,我们可以将其转换为Stream ,它基本上是一个惰性列表。 We want laziness because we don't want to have to read the entire file into memory before we start. 我们需要懒惰,因为我们不想在开始之前将整个文件读入内存。 Then the groupBy groups all counts of the same word type together. 然后, groupBy将相同单词类型的所有计数分组在一起。

val groupedCounts = countsByLine.toStream.groupBy(_._1)
println(groupedCounts.mapValues(_.toList)) // Map(e -> List((e,1)), f -> List((f,1)), a -> List((a,1)), b -> List((b,2), (b,1)), c -> List((c,1), (c,1), (c,1)), d -> List((d,3)))

And finally we can sum up the counts from each line for each word type by grabbing the second item from each tuple (the count), and summing: 最后,我们可以通过从每个元组中获取第二个项目(计数),然后对每个单词类型的每一行的计数求和:

val totalCounts = groupedCounts.mapValues(_.map(_._2).sum)
println(totalCounts.toList)
List((e,1), (f,1), (a,1), (b,3), (c,3), (d,3))

And there you have it. 那里有它。

You have a few mistakes: you've defined wordMap twice ( val is to declare a value). 您有一些错误:您已经定义了wordMap两次( val是声明一个值)。 Also, Map is immutable, so you either have to declare it as a var or use a mutable map (I suggest the former). 另外, Map是不可变的,因此您必须将其声明为var或使用可变映射(我建议使用前者)。

Try this: 尝试这个:

var wordMap = Map.empty[String,Int] withDefaultValue 0
val input = new java.util.Scanner(new java.io.File("textfile.txt"))
while(input.hasNext()){
  val token = input.next()
  wordMap += token -> (wordMap(token) + 1)
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM