简体   繁体   中英

How to judge if a large file is already in my hdfs cluster?

I have a big jar package to install to my hdfs cluster, however I don't want to install it twice if I have ever installed it before, so it need a method to judge whether the jar package in the hdfs is the same as my local one. I want to use checksum to solve this problem. My code is like:

val fs = FileSystem.get(conf)
val lfs = FileSystem.getLocal(conf);
val localchecksum = lfs.getFileChecksum(src)
val hdfschecksum = fs.getFileChecksum(dst)
if(!localchecksum.equals(hdfschecksum)){
  //upload the jar file
}

Unfortunately, the LocalFileSystem does not implement getFileChecksum, and return null by default, so my code does not work anymore. So, how can I judge whether the jar file is already in the hdfs cluster, any method is welcome.

Why not write your own md5 checksum? Load the file from hdfs do the checksum (your own version) , load the file from local, compute the checksum and compare.

Here is code to do it, copied from another SOF question

MessageDigest md = MessageDigest.getInstance("MD5");
try (InputStream is = Files.newInputStream(Paths.get("file.txt"))) {
  DigestInputStream dis = new DigestInputStream(is, md);
  /* Read stream to EOF as normal... */
}
byte[] digest = md.digest();

md5 checksum in java

HGFS checksum is relatively simple to implement yourself. Here is the source code for it DFSClient.java:703 . The all the complexity in the code is related to pulling blocks of the file from different data nodes and dealing with the error. Calculating on the local filesystem you just need to chop the file into the blocks, calculate the CRC of each block, collect all CRCs together and calculate MD5sum of the result. Just make sure to use the same block size as your HDFS.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM