简体   繁体   English

Apache Tika 无法使用文件内容检测 Content-Type

[英]Apache Tika unable to detect Content-Type using file content

I have been trying to detect the mime-type using the file content only, using Apache Tika Core and Apache Tika Parser 1.23 jars. Below is the code used for the same:我一直在尝试仅使用文件内容来检测 mime 类型,使用 Apache Tika Core 和 Apache Tika Parser 1.23 jars。下面是相同的代码:

Tika tika = new Tika();
File file = new File(filepath);
String mimeType = tika.detect(file);

Tika fails to detect the content type for a file with.tmp extension (Text/Plain file) and iso-8859-1 charset, with content as below: Tika 无法检测扩展名为.tmp(文本/普通文件)和 iso-8859-1 字符集的文件的内容类型,内容如下:

èå èå

A file with same configurations and below content is detected properly though:虽然可以正确检测到具有相同配置和以下内容的文件:

00000000000000000000 00000000000000000000 00000000000000000000 00000000000000000000 00000000000000000000 00000000000000000000 PREPREPREPREPREPREPREPREPREêîððåñïîíäåíòñêèå 000000000000000000000000000000000000000000000000000000000000000000000000000000000000来不足000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000来0000000000来解释室

I have tried using Linux File command to detect the mime type, and it works as expected.我已经尝试使用 Linux File 命令来检测 mime 类型,并且它按预期工作。 I also tried using Apache Tika App 2.1.0 GUI, but it behaves the same as my code.我还尝试使用 Apache Tika App 2.1.0 GUI,但它的行为与我的代码相同。

Any suggestions how to detect such files using the content of the file?有关如何使用文件内容检测此类文件的任何建议吗? Thank you in advance.先感谢您。

In the ISO 8859-1 encoding these characters will be in the 8-Bit character range ie extended range 128 to 256. However, whilst the TextDetector counts them it doesn't consider it in the calculation to detect if the file looks mostly text.在 ISO 8859-1 编码中,这些字符将在 8 位字符范围内,即扩展范围 128 到 256。但是,虽然 TextDetector 对它们进行计数,但在计算文件是否主要是文本时不会考虑它。

I'm on the Tika Development team, so will take a look at the history of the original tickets, and see how best to factor this in.我是 Tika 开发团队的一员,所以我会看看原始门票的历史,看看如何最好地考虑到这一点。

You can customise the detectors you are using in Tika by overriding the Tika Configuration file , or the org.apache.tika.detect.Detector service loader.您可以通过覆盖Tika 配置文件org.apache.tika.detect.Detector服务加载器来自定义您在 Tika 中使用的检测器。

For now, you may want to consider adding the the FileCommandDetector to your configuration using one of these approaches to allow you to continue to use this approach.现在,您可能需要考虑使用其中一种方法将FileCommandDetector添加到您的配置中,以允许您继续使用这种方法。

For example, with the following configuration file:例如,使用以下配置文件:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<properties>
  <detectors>
    <detector class="org.apache.tika.detect.FileCommandDetector"/>
    <detector class="org.apache.tika.detect.OverrideDetector"/>
    <detector class="org.apache.tika.mime.MimeTypes"/>
  </detectors>
</properties>

Running the below text detection will correctly mark this as text/plain:运行以下文本检测将正确地将其标记为文本/纯文本:

java -jar tika-app-2.1.0.jar --config=tika-config.xml -d test.tmp

Where as the default configuration will mark it as application/octet-stream默认配置会将其标记为application/octet-stream

java -jar tika-app-2.1.0.jar -d test.tmp

The FileCommandDetector executes the File command, if available on the local machine, to detect the type. FileCommandDetector执行文件命令(如果在本地计算机上可用)以检测类型。

Tika sees these binary values in the file: 1110100011100101 Tika 在文件中看到这些二进制值: 1110100011100101

These binary values could mean anything.这些二进制值可能意味着任何东西。 It could be an integer or long stored in this file, namely 59621它可以是一个 integer 或者长期存储在这个文件中,即 59621

This short amount of content gives Tika too little values to work with to make an educated guess, so it defaults to not recognizing it, as it's uncertain of it's type, and the extension doesn't help it clearing up the content above the threshold of certainty.这种少量的内容给 Tika 提供的值太少而无法进行有根据的猜测,因此它默认不识别它,因为它不确定它的类型,并且扩展名无助于清除超出阈值的内容肯定。

This is why your longer file does work, as the change that it's a long data segmented piece of integers masking as a text file are smaller.这就是为什么您的较长文件确实有效的原因,因为它是一个长数据分段的整数片段,作为文本文件进行掩码的变化较小。

When tika fails due to shortness of file, try doing a system call via shell_exec() as a backup, to make a best guess of the filetype.当 tika 由于文件不足而失败时,尝试通过shell_exec()进行系统调用作为备份,以最好地猜测文件类型。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM