简体   繁体   中英

Hadoop Streaming Memory Usage

I'm wondering where the memory is used in the following job:

  • Hadoop Mapper/Reducer Heap Size: -Xmx2G
  • Streaming API:

    • Mapper: /bin/cat
    • Reducer: wc
  • Input File is a 350MByte file containg a single line full of a 's.

This is a simplified version of the real problem we've encountered.

Reading the file from the HDFS and constructing a Text -Object should not amount to more than 700MB Heap - assuming that Text does also use 16-Bit per Character - I'm not sure about that but I could imagine that Text only uses 8-Bit.

So there is these (worst-case) 700MB Line. The Line should fit at least 2x in the Heap but I'm getting always out of memory errors.

Is this a possible Bug in Hadoop (eg unaccary copies) or do I just don't understand some required memory intensive steps?

Would be really thankful for any further hints.

The memory given to each child JVM running a task can be changed by setting the mapred.child.java.opts property. The default setting is -Xmx200m , which gives each task 200 MB of memory.

When you are saying -

Input File is a 350MByte file containg a single line full of a's.

I'm assuming you file has a single line of all a's with a single endline delimiter.

If that is taken up as a value in the map(key, value) function, I think, you might have memory issues, since, you task have can use only 200MB and you have a record in memory which is of 350MB.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM