简体   繁体   English

Java:Apache Tika:从.doc文件提取文本时出现意外的运行时异常。 该文件打开,MSWord中没有任何错误

[英]Java: Apache Tika: unexpected runtimeexception when extracting text from .doc file. The file opens without any error in MSWord

I have used TikaParser to extract plain text from '.doc' files 我已经使用TikaParser从'.doc'文件中提取纯文本

public static void main(String[] args) throws Exception {
    ContentHandler handler = new ToHTMLContentHandler();
    AutoDetectParser parser = new AutoDetectParser();
    Metadata metadata = new Metadata();
    ParseContext context = new ParseContext();

    FileInputStream content = new FileInputStream("file.doc");
    parser.parse(content, handler, metadata, context);
    System.out.println(handler.toString());

    String[] metadataNames = metadata.names();
    for (String name : metadataNames) {
        System.out.println(name + " : " + metadata.get(name));
    }

    FileOutputStream outStream = new FileOutputStream("file.doc.txt");
    outStream.write(handler.toString().getBytes());
    outStream.close();
    content.close();
}

This is working for most of the files but for a specific file, it is throwing the following exception 这适用于大多数文件,但对于特定文件,它将引发以下异常

Exception in thread "main" org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@7c417213
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at com.goarya.app.resumestorage.migration.TikaParser.main(TikaParser.java:29)
Caused by: java.lang.IllegalArgumentException: The end (7161) must not be before the start (7162)
at org.apache.poi.hwpf.usermodel.Range.sanityCheckStartEnd(Range.java:208)
at org.apache.poi.hwpf.usermodel.Range.<init>(Range.java:194)
at org.apache.poi.hwpf.usermodel.Paragraph.<init>(Paragraph.java:165)
at org.apache.poi.hwpf.usermodel.Paragraph.newParagraph(Paragraph.java:144)
at org.apache.poi.hwpf.usermodel.Range.getParagraph(Range.java:766)
at org.apache.poi.hwpf.extractor.WordExtractor.getParagraphText(WordExtractor.java:168)
at org.apache.poi.hwpf.extractor.WordExtractor.getMainTextboxText(WordExtractor.java:145)
at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:183)
at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:169)
at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:130)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
... 3 more

The doc file when opened in Microsoft Word shows no error. 在Microsoft Word中打开该doc文件时,未显示任何错误。

Also, in C# using Microsoft.Office.Interop.Word gives plain text. 另外,在C#中,使用Microsoft.Office.Interop.Word会给出纯文本。

How do I overcome this issue using Apache Tika? 如何使用Apache Tika克服此问题?

Edit: adding sample doc for this scenario 编辑:为此场景添加示例文档

I am using tika cote1.2 jar and my program has been run successfully with the following code. 我正在使用tika cote1.2 jar,并且我的程序已使用以下代码成功运行。

import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;

import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.sax.ToHTMLContentHandler;
import org.xml.sax.SAXException;


public class Exmple2 {
     public static void main(final String[] args) throws IOException,TikaException, SAXException {

         ToHTMLContentHandler handler = new ToHTMLContentHandler();
            AutoDetectParser parser = new AutoDetectParser();
            Metadata metadata = new Metadata();
            ParseContext context = new ParseContext();

            FileInputStream content = new FileInputStream("/home/ist/FTRDocuments/taableDis.docx");
            parser.parse(content, handler, metadata, context);
            System.out.println(handler.toString());

            String[] metadataNames = metadata.names();
            for (String name : metadataNames) {
                System.out.println(name + " : " + metadata.get(name));
            }

            FileOutputStream outStream = new FileOutputStream("/home/ist/file.doc.txt");
            outStream.write(handler.toString().getBytes());
            outStream.close();
            content.close();
     }


}

The only thing change with tika1.2 is ToHTMLContentHandler where you are using ContentHandler. tika1.2唯一的变化是您正在使用ContentHandler的ToHTMLContentHandler。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用Java中的Apache Tika从pdf文件中提取文本 - Extract text from a pdf file using Apache Tika in java 有没有办法使用Apache Tika从文档文件中获取所有样式? - Is there a way to get all styles from a doc file with Apache Tika? 如何使用Apache Tika从.wps文件提取文本? - How to use Apache Tika to extract text from a .wps file? Java Utility,可将任何文件的内容转换为文本文件。 - Java Utility to convert content of any file to text file. Apache Beam管道从csv文件读取,拆分,groupbyKey并写入文本文件时出现“ IllegalStateException”错误。 为什么? - “IllegalStateException” error for Apache Beam pipeline to read from csv file, split, groupbyKey and write to text file. Why? Java。 从文件中读取具有变量a,b和c的二次公式程序。 意外的错误 - Java. Quadratic formula program with variables a, b, and c read from file. Unexpected error 从FTP文件流解析Apache Tika - Apache Tika parsing from FTP file stream 使用 java 中的 apache poi 在 Doc 文件中的表格单元格中输入文本 - Enter text to a Table Cell in a Doc file using apache poi in java 如何使用Apache POI读取Java中的.DOC文件以将图像与文本分开? - How do I use Apache POI to read a .DOC file in Java to separate images from text? 我如何将pdf文件转换为Apache Tika中的文本 - How do i convert a pdf file to text in apache tika
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM