Apache Tika unable to detect Content-Type using file content

Question

I have been trying to detect the mime-type using the file content only, using Apache Tika Core and Apache Tika Parser 1.23 jars. Below is the code used for the same:

Tika tika = new Tika();
File file = new File(filepath);
String mimeType = tika.detect(file);

Tika fails to detect the content type for a file with.tmp extension (Text/Plain file) and iso-8859-1 charset, with content as below:

èå

A file with same configurations and below content is detected properly though:

00000000000000000000 00000000000000000000 00000000000000000000 00000000000000000000 00000000000000000000 00000000000000000000 PREPREPREPREPREPREPREPREPREêîððåñïîíäåíòñêèå

I have tried using Linux File command to detect the mime type, and it works as expected. I also tried using Apache Tika App 2.1.0 GUI, but it behaves the same as my code.

Any suggestions how to detect such files using the content of the file? Thank you in advance.

Answer 1

In the ISO 8859-1 encoding these characters will be in the 8-Bit character range ie extended range 128 to 256. However, whilst the TextDetector counts them it doesn't consider it in the calculation to detect if the file looks mostly text.

I'm on the Tika Development team, so will take a look at the history of the original tickets, and see how best to factor this in.

You can customise the detectors you are using in Tika by overriding the Tika Configuration file , or the org.apache.tika.detect.Detector service loader.

For now, you may want to consider adding the the FileCommandDetector to your configuration using one of these approaches to allow you to continue to use this approach.

For example, with the following configuration file:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<properties>
  <detectors>
    <detector class="org.apache.tika.detect.FileCommandDetector"/>
    <detector class="org.apache.tika.detect.OverrideDetector"/>
    <detector class="org.apache.tika.mime.MimeTypes"/>
  </detectors>
</properties>

Running the below text detection will correctly mark this as text/plain:

java -jar tika-app-2.1.0.jar --config=tika-config.xml -d test.tmp

Where as the default configuration will mark it as application/octet-stream

java -jar tika-app-2.1.0.jar -d test.tmp

The FileCommandDetector executes the File command, if available on the local machine, to detect the type.

Answer 2

Tika sees these binary values in the file: 1110100011100101

These binary values could mean anything. It could be an integer or long stored in this file, namely 59621

This short amount of content gives Tika too little values to work with to make an educated guess, so it defaults to not recognizing it, as it's uncertain of it's type, and the extension doesn't help it clearing up the content above the threshold of certainty.

This is why your longer file does work, as the change that it's a long data segmented piece of integers masking as a text file are smaller.

When tika fails due to shortness of file, try doing a system call via shell_exec() as a backup, to make a best guess of the filetype.

Apache Tika unable to detect Content-Type using file content

Question

2 answers

solution1
2 2021-10-05 22:16:40

solution2
0 2021-10-05 10:03:17

Apache Tika unable to detect Content-Type using file content

Question

2 answers

solution1 2 2021-10-05 22:16:40

solution2 0 2021-10-05 10:03:17

solution1
2 2021-10-05 22:16:40

solution2
0 2021-10-05 10:03:17