File type detection in Java without I/O

Question

There is a built-in method in the Java JDK that detects file types:

Files.probeContentType(Paths.get("/temp/word.doc"));

The javadoc says that a FileTypeDetector may examine the filename, or it may examine a few bytes in the file, which means that it would have to actually try to pull the file from a URL.

This is unacceptable in our app; the content of the file is available only through an InputStream.

I tried to step through the code to see what the JDK is actually doing, but it seems that it goes to FileTypeDetectors.defaultFileTypeDetector.probeContentType(path) which goes to sun.nio.fs.AbstractFileTypeDetector , and I couldn't step into that code because there's no source attachment.

How do I use JDK file type detection and force it to use file content that I supply, rather than having it go out and perform I/O on its own?

Answer 1

The docs for Files.probeContentType() explain how to plug in your own FileTypeDetector implementation, but if you follow the docs you'll find that there is no reliable way to ensure that your implementation is the one that is selected (the idea is that different implementations serve as fallbacks for each other, not alternatives). There is certainly no documented way to prevent the built-in implementation from ever reading the target file.

You can surely find a map of common filename extensions to content types in various places around the web and probably on your own system; mime.types is a common name for such files. If you want to rely only on such a mapping file then you probably need to use your own custom facility, not the Java standard library's.

Answer 2

The JDK's Files.probeContentType() simply loads a FileTypeDetector available in your JDK installation and asks it to detect the MIME type. If none exists then it does nothing.

Apache has a library called Tika which does exactly what you want. It determines the MIME type of the given content. It can also be plugged into your JDK to make your Files.probeContentType() function using Tika. Check this tutorial for quick code - http://wilddiary.com/detect-file-type-from-content/

Answer 3

If you are worried about reading the contents of an InputStream you can wrap it in a PushBackInputStream to "unread" those bytes so the next detector implementation can read it.

Usually binary file's magic numbers are 4 bytes so having a new PushBackInputStream(in, 4) should be sufficient.

PushBackInputStream pushbackStream = new PushBackInputStream(in, 4);
byte[] magicNumber = new byte[4];
//for this example we will assume it reads whole array
//for production you will need to check all 4 bytes read etc
pushbackStream.read(magicNumber);

//now figure out content type basic on magic number
ContentType type = ...
//now pushback those 4 bytes so you can read the whole stream
pushbackStream.unread(magicNumber);

//now your downstream process can read the pushbackStream as a
//normal InputStream and gets those magic number bytes back
...

File type detection in Java without I/O

Question

3 answers

solution1
2 2015-01-27 19:07:50

solution2
1 2015-02-22 14:42:35

solution3
0 2015-01-28 04:23:42

File type detection in Java without I/O

Question

3 answers

solution1 2 2015-01-27 19:07:50

solution2 1 2015-02-22 14:42:35

solution3 0 2015-01-28 04:23:42

solution1
2 2015-01-27 19:07:50

solution2
1 2015-02-22 14:42:35

solution3
0 2015-01-28 04:23:42