简体   繁体   中英

Hadoop word count

For the word count example in Hadoop, in the map function, it write out the word and one to files as intermediate result and use the reduce to do the sum. Why not use a hashmap in the mapper function, which the key is word and the value is the count, if one word occurs more than once in 1 file spit, the value for the word will be added. in the end of the mapper function, write out the result.

In this way, it is more efficient than the original design(without using combiner), although using combiner, the efficiency should be equal.

Any advice?

Yes, you can use hashmap as well. But you need to consider worst case scenarios while designing your solution.

Normally, the size of the block is 128 MB and consider that there small words(in terms of word length) with no or very less repetitions. In this case, you will have many words and thus no. of entries in HashMap will increase, consuming much more amount of memory. You need to take into account that there could be many different jobs operating on the same data node, so this HashMap consuming more amount of RAM will eventually slow down other jobs as well. Also, when the size of the HashMap gets increasing, it has to perform Rehashing which adds more time for your job execution.

我知道这是一篇旧帖子,但对于将来寻求 Hadoop 帮助的人,也许可以查看此问题以获取另一个参考: Hadoop 字数:接收以字母“c”开头的单词总数

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM