Java通过读取前几个字节读取实际文件类型（Forensic）

Question

hello I need a way to read first four bytes of any file using Java. 你好我需要一种方法来使用Java读取任何文件的前四个字节。 Why the first four bytes? 为什么前四个字节？ Because it's forensic thumb print of the actual file type (File extension not reliable as it can be falsified) 因为它是实际文件类型的取证拇指打印（文件扩展名不可靠，因为它可以被伪造）

http://en.wikipedia.org/wiki/File_carving http://en.wikipedia.org/wiki/File_carving

Now, reading a file this way (below, Java code) will read the file "content" , I think it skips file header information...? 现在，以这种方式读取文件（下面是Java代码）将读取文件“内容” ，我认为它会跳过文件头信息......？ I can't get the Magic Number (first four bytes) and thus unable to identify/confirm the true file type of a given specimen. 我无法获得Magic Number （前四个字节），因此无法识别/确认给定样本的真实文件类型。

byte[] buffer = new byte[4];
InputStream is = new FileInputStream("somwhere.in.the.dark");
if (is.read(buffer) != buffer.length) { 
    // do something 
}
is.close();

Read First 4 Bytes of File 读取前4个文件字节

Suggestion please? 建议好吗？

Answer 1

As Blank suggested, https://tika.apache.org 正如Blank所说， https：//tika.apache.org

Here's the code - in this example, "test3_iamexe.txt" is an exe cutable, with file extension renamed to " txt " by attacker. 这是代码 - 在这个例子中， “test3_iamexe.txt”是一个exe文件，文件扩展名被攻击者重命名为“ txt ”。

import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.mime.MediaType;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.sax.BodyContentHandler;
import org.apache.tika.parser.AbstractParser;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.sax.XHTMLContentHandler;
import org.xml.sax.ContentHandler;
import org.xml.sax.SAXException;

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.util.Collections;
import java.util.Set;
import org.apache.tika.metadata.Property;

public class TestTika {

    public static void main(String[] args) {
        File file = null;
    InputStream stream = null;
        String contentType = null;

        try
        {
            file = new File("C:\\tmp\\test3_iamexe.txt");
            stream = new FileInputStream(file);

            AutoDetectParser parser = new AutoDetectParser();
            BodyContentHandler handler = new BodyContentHandler();
            Metadata metadata = new Metadata();

            try {
                // This step here is a little expensive
                parser.parse(stream, handler, metadata);
            } finally {
                stream.close();
            }

            // metadata is a HashMap, you can loop over it see what you need. Alternatively, I think Content-Type is what you need
            contentType = metadata.get("Content-Type");

        } catch(...)
        {
            // handle it
        }

        return;
    }
}

Answer 2

I think you can use: 我想你可以用：

IOUtils.toByteArray(InputStream is)

See here : IOUtils.toByteArray to convert your InputStream to a byteArray, then get the first 4 bytes. 请参阅此处： IOUtils.toByteArray将InputStream转换为byteArray，然后获取前4个字节。

Answer 3

Use the java.nio.file API for that; 使用java.nio.file API; and specifically, write your own FileTypeDetector . 特别是，编写自己的FileTypeDetector 。

I happen to be doing exactly that in one of my projects: 我碰巧在我的一个项目中正是这样做的：

https://github.com/fge/java7-fs-more/tree/topic/filetypedetector https://github.com/fge/java7-fs-more/tree/topic/filetypedetector

With this I am able to use Files.probeContentType() and return the exact type of the file as a MIME string. 有了这个，我可以使用Files.probeContentType()并将文件的确切类型作为MIME字符串返回。

See the test file . 查看测试文件。

Now, how it works: 现在，它是如何工作的：

you write your own implementation of a FileTypeDetector ( here is an example to detect PNG files); 你编写自己的FileTypeDetector实现（这是一个检测PNG文件的例子）;
you make it return null if the detector can't determine the type; 如果检测器无法确定类型，则使其返回null ;
you register the implementation in META-INF/services/java.nio.file.spi.FileTypeDetector (see here ); 你在META-INF/services/java.nio.file.spi.FileTypeDetector注册实现（见这里）;
test it... 测试一下......
and use Files.probeContentType() . 并使用Files.probeContentType() 。

Java通过读取前几个字节读取实际文件类型（Forensic）

问题描述

3 个解决方案

解决方案1
3 已采纳 2015-05-01 17:38:14

解决方案2
2 2015-04-30 10:15:26

解决方案3
1 2015-04-30 10:16:02

Java通过读取前几个字节读取实际文件类型（Forensic）

问题描述

3 个解决方案

解决方案1 3 已采纳 2015-05-01 17:38:14

解决方案2 2 2015-04-30 10:15:26

解决方案3 1 2015-04-30 10:16:02

解决方案1
3 已采纳 2015-05-01 17:38:14

解决方案2
2 2015-04-30 10:15:26

解决方案3
1 2015-04-30 10:16:02