简体   繁体   中英

Java Tika cannot get embedded files from rar file

With the standard implementation, I pass a doc file, inside of which there is an image.png image and text.

With the standard implementation, I pass a doc file, inside of which there is an image.png image and text

In order to get the files, Tika uses the internal ParsingEmbeddedDocumentExtractor class, inside which the parseEmbedded method is used.

First I implement the necessary elements and call the parse method:

//other objects
AutoDetectParser parser = new AutoDetectParser();
ParseContext pc = new ParseContext(); 
Metadata metadata = new Metadata();
Tika tika = new Tika();
BodyContentHandler ch = new BodyContentHandler(-1); 
InputStream is = new FileInputStream(new File("src/main/resources/sample.txt"));

//using
parser.parse(is, ch, metadata, pc);  

Next, if Tika selects the right parser and the ParsingEmbeddedDocumentExtractor class is called inside it, namely its parseEmbedded method.

With this approach, if my input data is a file: 1.docx, then at the output I can see:

1.docx
 -image.png
 -text

Or if the input file is: 222.rar, then at the output I have:

222.rar
  -1.docx
    -image.png
    -text

At the same time, to control the process of extracting parts, it was decided to redefine the ParsingEmbeddedDocumentExtractor class, with its key method parseEmbedded .

It turns out that we create a CustomParsingEmbeddedDocumentExtractor and save it in the Tika configuration so that it replaces the original ParsingEmbeddedDocumentExtractor class:

CustomParsingEmbeddedDocumentExtractor customEmbeddedDocExtractor = new CustomParsingEmbeddedDocumentExtractor(pc);
pc.set(EmbeddedDocumentExtractor.class, customEmbeddedDocExtractor);   

At the same time, I do not make any changes inside the new class and inside its parseEmbedded method.

When testing a new class on the same input data, we get the following output:

1) input file: 1.doc

output: 
1.doc
 -image.png
 -text


2) inputfile: 222.rar

222.rar
  1.doc

Comparing the work of the two classes, we get the following result:

Result from ParsingEmbeddedDocumentExtractor:
input file: 222.rar

output:
222.rar
  -1.docx
    -image.png
    -text

Result from CustomParsingEmbeddedDocumentExtractor:
input file: 222.rar

output:
222.rar
  -1.docx

When using this new class, it is not possible to get the objects I need from the 222.rar/1.docx file - an image and text. At the same time, if you use the old ParsingEmbeddedDocumentExtractor class, you can get these elements.


It is not clear what caused such a discrepancy in the results, if only because I am just creating exactly the same class and saving it in place of a new one. I am not making changes to my new class, and at the same time I get a different output result.


Please tell me why Tika can't get the data with this approach? Maybe someone has already encountered a similar situation, thank you in advance.

For the full operation of the CustomParsingEmbeddedDocumentExtractor class, it is necessary

  1. save the class to the configuration (in class ParseContext pc)

  2. save the desired parser to the configuration (in our case, parser is a class AutoDetectParser parser)

Therefore, instead of such an implementation:

CustomParsingEmbeddedDocumentExtractor customEmbeddedDocExtractor = new CustomParsingEmbeddedDocumentExtractor(pc);
pc.set(EmbeddedDocumentExtractor.class, customEmbeddedDocExtractor);   

We need such an implementation:

if (pc.get(EmbeddedDocumentExtractor.class) == null) {
    Parser p = pc.get(Parser.class);
    if (p == null) {
        pc.set(Parser.class, parser);
    }
                    
    CustomParsingEmbeddedDocumentExtractor customEmbeddedDocExtractor = new CustomParsingEmbeddedDocumentExtractor(pc);
    pc.set(EmbeddedDocumentExtractor.class, customEmbeddedDocExtractor);
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM