HDFS - load mass amount of files

Question

For testing purposes I'm trying to load a massive amount of small files into HDFS. Actually we talk about 1 Million (1'000'000) files with a size from 1KB to 100KB. I generated those files with an R-Script on a Linux-System in one folder. Every file has a information structure that contains a header with product information and a different number of columns with numeric information.

The problem is when I try to upload those local files into HDFS with the command:

hdfs dfs -copyFromLocal /home/user/Documents/smallData /

Then i get one of the following Java-Heap-Size errors:

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space

Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded

I use the Cloudera CDH5 distribution with a Java-Heap-Size about 5 GB. Is there another way than increasing this Java-Heap-Size even more? Maybe a better way to load this mass amount of data into HDFS?

I'm very thankfully for every helpful comment!

Answer 1

If you will increase the memory and store the files in HDFS. After this you will get many problems at the time of processing.

Problems with small files and HDFS

A small file is one which is significantly smaller than the HDFS block size (default 64MB). If you're storing small files, then you probably have lots of them (otherwise you wouldn't turn to Hadoop), and the problem is that HDFS can't handle lots of files.

Every file, directory and block in HDFS is represented as an object in the namenode's memory, each of which occupies 150 bytes, as a rule of thumb. So 10 million files, each using a block, would use about 3 gigabytes of memory. Scaling up much beyond this level is a problem with current hardware. Certainly a billion files is not feasible.

Furthermore, HDFS is not geared up to efficiently accessing small files: it is primarily designed for streaming access of large files. Reading through small files normally causes lots of seeks and lots of hopping from datanode to datanode to retrieve each small file, all of which is an inefficient data access pattern.

Problems with small files and MapReduce

Map tasks usually process a block of input at a time (using the default FileInputFormat). If the file is very small and there are a lot of them, then each map task processes very little input, and there are a lot more map tasks, each of which imposes extra bookkeeping overhead. Compare a 1GB file broken into 16 64MB blocks, and 10,000 or so 100KB files. The 10,000 files use one map each, and the job time can be tens or hundreds of times slower than the equivalent one with a single input file.

There are a couple of features to help alleviate the bookkeeping overhead: task JVM reuse for running multiple map tasks in one JVM, thereby avoiding some JVM startup overhead (see the mapred.job.reuse.jvm.num.tasks property), and MultiFileInputSplit which can run more than one split per map.

Solution

Hadoop Archives (HAR files)

Create .HAR File Hadoop Archives (HAR files) were introduced to HDFS in 0.18.0 to alleviate the problem of lots of files putting pressure on the namenode's memory. HAR files work by building a layered filesystem on top of HDFS. A HAR file is created using the hadoop archive command, which runs a MapReduce job to pack the files being archived into a small number of HDFS files

hadoop archive -archiveName name -p <parent> <src>* <dest> 
hadoop archive -archiveName foo.har -p /user/hadoop dir1 dir2 /user/zoo

Sequence Files

The usual response to questions about “the small files problem” is: use a SequenceFile. The idea here is that you use the filename as the key and the file contents as the value. This works very well in practice. Going back to the 10,000 100KB files, you can write a program to put them into a single SequenceFile, and then you can process them in a streaming fashion (directly or using MapReduce) operating on the SequenceFile. There are a couple of bonuses too. SequenceFiles are splittable, so MapReduce can break them into chunks and operate on each chunk independently. They support compression as well, unlike HARs. Block compression is the best option in most cases, since it compresses blocks of several records (rather than per record)

HBase

If you are producing lots of small files, then, depending on the access pattern, a different type of storage might be more appropriate. HBase stores data in MapFiles (indexed SequenceFiles), and is a good choice if you need to do MapReduce style streaming analyses with the occasional random look up. If latency is an issue, then there are lots of other choices

Answer 2

Try to increase HEAPSIZE

HADOOP_HEAPSIZE=2048 hdfs dfs -copyFromLocal /home/user/Documents/smallData

look here

Answer 3

First of all: If this isn't a stress test on your namenode it's ill advised to do this. But I assume you know what you are doing. (expect slow progress on this)

If the objective is to just get the files on HDFS, try doing this in smaller batches or set a higher heap size on your hadoop client .

You do this like rpc1 mentioned in his answer by prefixing HADOOP_HEAPSIZE=<mem in Mb here> to your hadoop -put command.

Answer 4

Hadoop Distributed File system is not good with many small files but with many big files. HDFS keep a record in a look up table that points to every file/block in HDFS and this Look up table usually is loaded in memory. So you should not just increase java heap size but also increase the heap size of the name node inside hadoop-env.sh, this is the default:

export HADOOP_HEAPSIZE=1000
export HADOOP_NAMENODE_INIT_HEAPSIZE="1000"

If you are going to do processing on those files, you should expect low performance on the first MapReduce job you run on them (Hadoop creates number of map tasks as the number of files/blocks and this will overload your system except when you use combineinputformat). advice you to either merge the files into big files (64MB/ 128MB) or use another data source (not HDFS).

Answer 5

For solve this problem, I build a single file with some format. The content of file are all the small files. The format will be like that:

<DOC>
  <DOCID>1</DOCID>
  <DOCNAME>Filename</DOCNAME>
  <DOCCONTENT>
    Content of file 1
  </DOCCONTENT>
</DOC>

This structure could be more or less field, but the idea is the same. For example, I have use this stucture:

<DOC>
  <DOCID>1</DOCID>
  Content of file 1
</DOC>

And handle more of six million files.

If you desire process each file for a one map task, you could be delete \\n char between and tags. After this, you only parse the structure and have the doc identifier and Content.

HDFS - load mass amount of files

Question

5 answers

solution1
1 2015-08-14 05:49:54

Problems with small files and HDFS

Problems with small files and MapReduce

Solution

Hadoop Archives (HAR files)

Sequence Files

HBase

solution2
0 2015-08-13 07:45:23

solution3
0 2015-08-13 07:53:17

solution4
0 2015-08-13 23:11:36

solution5
0 2015-08-15 02:53:21

HDFS - load mass amount of files

Question

5 answers

solution1 1 2015-08-14 05:49:54

Problems with small files and HDFS

Problems with small files and MapReduce

Solution

Hadoop Archives (HAR files)

Sequence Files

HBase

solution2 0 2015-08-13 07:45:23

solution3 0 2015-08-13 07:53:17

solution4 0 2015-08-13 23:11:36

solution5 0 2015-08-15 02:53:21

solution1
1 2015-08-14 05:49:54

solution2
0 2015-08-13 07:45:23

solution3
0 2015-08-13 07:53:17

solution4
0 2015-08-13 23:11:36

solution5
0 2015-08-15 02:53:21