簡體   English   中英

使用Java中的Apache Tika從pdf文件中提取文本

[英]Extract text from a pdf file using Apache Tika in java

try {
      File file = new File("Example.pdf");
      String content = new Tika().parseToString(file);
      System.out.println("The Content: " + content);
    } catch (Exception e) {
       e.printStackTrace();
    }

我導入了java.io.File並導入了org.apache.tika.Tika ; 但在運行此代碼時,我收到如下錯誤:

線程“main”中的異常java.lang.NoSuchMethodError:org.slf4j.spi.LocationAwareLogger.log(Lorg / slf4j / Marker; Ljava / lang / String; ILjava / lang / String; Ljava / lang / Throwable;)V at org位於org.apache.pdfbox.pdmodel.font的org.apache.pdfbox.pdmodel.font.FileSystemFontProvider.loadDiskCache(FileSystemFontProvider.java:461)上的.apache.commons.logging.impl.SLF4JLocationAwareLog.warn(SLF4JLocationAwareLog.java:162) .FileSystemFontProvider。(FileSystemFontProvider.java:217)org.apache.pdfbox.pdmodel.font.FontMapperImpl $ DefaultFontProvider。(FontMapperImpl.java:130)at org.apache.pdfbox.pdmodel.font.FontMapperImpl.getProvider(FontMapperImpl.java) :149)在org.apache的org.apache.pdfbox.pdmodel.font.FontMapperImpl.findFont(FontMapperImpl.java:413)中的org.apache.pdfbox.pdmodel.font.FontMapperImpl.findFontBoxFont(FontMapperImpl.java:376)。位於org.apache.pdfbox.pdmod的org.apache.pdfbox.pdmodel.font.PDType1Font。(PDType1Font.java:146)的pdfbox.pdmodel.font.FontMapperImpl.getFontBoxFont(FontMapperImpl.java:350) org.apache.pdfbox.pdfodel.PDResources.getFont(PDResources.java :143)org.apache上的org.apache.pdfstream.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:60)org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:838)。 pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:495)org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:469)org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java: 150)org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139)atg.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391)org.apache.tika.parser.pdf位於org.apac的org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)的.PDF2XHTML.processPage(PDF2XHTML.java:147) 位於org.apache.tika.parser.pdf.PDFParser的org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117)中的.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)。在org.apache的org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)org.apache解析(PDFParser.java:167)org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)位於org.apache.tika.Tika.parseToString(Tika.java:527)的.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)org.apache.tika.Tika.parseToString(Tika.java:642)at at java_programs.PdfParse.main(PdfParse.java:22)

以下似乎對我有用。 我得到了我想要的字符串,但我也在控制台中打印了一些警告。

[在Windows上]我編譯並運行它:

javac -cp .;tika-app-1.16.jar Test.java

java -cp .;tika-app-1.16.jar Test

你用的是什么tika jar? 我添加了另一種方法( tikaPdfTest() )來顯示從PDF獲取可能對您有幫助的文本的不同方式。

import java.io.File;
import org.apache.tika.Tika;

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;

import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.pdf.PDFParser;
import org.apache.tika.sax.BodyContentHandler;
import org.apache.tika.Tika;

import org.xml.sax.SAXException;

public class Test {
    public static void main(final String[] args) {
        //Your way
        try {
            File file = new File("Example.pdf");
            String content = new Tika().parseToString(file);
            System.out.println("The Content: " + content);
        } catch (final Exception e) {
            e.printStackTrace();
        }

        //Another way
        try {
            System.out.println("The contents:\t[" + Test.tikaPdfTest("Example.pdf") + "]");
        } catch (final Exception e) {
            e.printStackTrace();
        }
    }

    public static String tikaPdfTest(final String fileName) throws IOException, SAXException, TikaException {
        try(final FileInputStream inputstream = new FileInputStream(new File(fileName))) {
            final BodyContentHandler handler = new BodyContentHandler();
            new PDFParser().parse(inputstream, handler, new Metadata(), new ParseContext());
            return handler.toString().trim();
        }
    }
}

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM