简体   繁体   English

Tika AutoDetectParser返回空字符串?

[英]Tika AutoDetectParser returning empty string?

I'm attempting to use Tika's AutoDetectParser to pull a file's content. 我正在尝试使用Tika的AutoDetectParser来提取文件的内容。 I originally thought this was a dependency issue but cannot fathom how that could still be true now that i'm including all of tika-app in my jar. 我原本以为这是一个依赖性问题,但是无法理解为什么现在我可以将所有tika-app包含在我的jar中。

AutoDetect Parser returns emptry string here : AutoDetect Parser在此返回emptry字符串:

BodyContentHandler handler = new BodyContentHandler();  
AutoDetectParser parser = new AutoDetectParser();
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
FileInputStream mypdfstream = new FileInputStream(new File("mypdf.pdf"));
parser.parse(mypdfstream,handler,metadata,context);
System.out.println(handler.toString());

Further confusing me is the fact that using a standard PDFParser works fine...: 令我感到困惑的是,使用标准PDFParser工作得很好......:

BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
FileInputStream mypdfstream = new FileInputStream(new File("mypdf.pdf"));
PDFParser pdfparser = new PDFParser();
pdfparser.parse(mypdfstream,handler,metadata,context);
System.out.println(handler.toString());

I have included both the tika-app and tika-parsers jar on my classpath and included them within the jar created by ant. 我在我的类路径中包含了tika-app和tika-parsers jar,并将它们包含在ant创建的jar中。

relevant portions of build.xml build.xml相关部分

<javac srcdir="${src}" destdir="${build}">
                <classpath>
                        <pathelement path = "lib/tika-app-1.11.jar"/>
                        <pathelement path = "lib/tika-parsers-1.11.jar"/>
                </classpath>
 </javac>

<jar jarfile="${dist}/lib/MyProject-${DSTAMP}.jar" basedir="${build}">
        <zipgroupfileset dir="lib" includes="tika-app-1.11.jar"/>
        <zipgroupfileset dir="lib" includes="tika-parsers-1.11.jar"/>
</jar>

Edit: I looked at my list of supportedTypes with parser.getSupportTypes(context)) and it was empty. 编辑:我用parser.getSupportTypes(context))查看了我的supportedTypes列表,它是空的。 As is the list of parsers returned from parser.getParsers() . parser.getParsers()返回的解析器列表一样。

So perhaps this is yet another dependency issue? 那么这可能是另一个依赖问题呢? This truly surprises me given tika-app is included. 鉴于包括tika-app,这真的让我感到惊讶。

I have the same issue, i have corrected adding the Tika Core and Parser dependency on my Pom.xml like this again and then Update Maven on Eclipse. 我有同样的问题,我已经纠正了再次添加Tika Core和Parser依赖于我的Pom.xml,然后在Eclipse上更新Maven。

    <dependency>
      <groupId>org.apache.tika</groupId>
      <artifactId>tika-core</artifactId>
      <version>1.18</version>
    </dependency>
    <dependency>
      <groupId>org.apache.tika</groupId>
      <artifactId>tika-parsers</artifactId>
      <version>1.18</version>
    </dependency>

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM