简体   繁体   中英

Writing to a file in HDFS in Hadoop

I was looking for a Disk intensive Hadoop application to test the I/O activity in Hadoop but I couldn't find any such application which kept the Disk utilization above, say 50% or some such application which actually keeps disk busy. I tried randomwriter, but that surprisingly is not disk I/o intensive.

So, I wrote a tiny program to create a file in Mapper and write some text into it. This application works well, but the utilization is high only in the master node which is also name node, job tracker and one of the slaves. The disk utilization is NIL or negligible in the other task trackers. I'm unable to understand why disk I/O is so low in task trackers. Could anyone please nudge me in right direction if I'm doing something wrong? Thanks in advance.

Here is my sample code segment that I wrote in WordCount.java file to create and write UTF string into a file-

Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
Path outFile;
while (itr.hasMoreTokens()) {
    word.set(itr.nextToken());
    context.write(word, one);
    outFile = new Path("./dummy"+ context.getTaskAttemptID());
    FSDataOutputStream out = fs.create(outFile);

    out.writeUTF("helloworld");
    out.close();
    fs.delete(outFile);
  }

I think that any mechanism which creates java objects per cell in each row, and run any doing serialization of the java objects before saving it to disk has little chance to utilize IO.
In my experience serialization is working in speed of several MBs per second or a bit more, but not 100 MB per second.
So what you did avoiding hadoop layers on the output path is quite right. Now lets consider how write to HDFS works. The data is written to the local disk via local datanode, and then synchronously to other nodes in the network, depending on your replication factor. In this case you can not write more data into HDFS then Your network bandwidth. If your cluster is relatively small things get worth. For 3 node cluster and triple replication you will path all data to all nodes so whole cluster HDFS write bandwidth will be about 1 GBit - if you have such network.
So, I would suggest to:
a) Reduce replication factor to 1, thus stop being bound by network.
b) Write bigger chunks of data in one call to mapper

OK. I must have been really stupid for not checking before. The actual problem was that all of my data nodes were not really running. I reformatted the namenode and everything fell back into place, I was getting a utilization of 15-20% which is not bad for WC. I will run it for the TestDFSIO and see if I could utilize the Disk even more.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM