使用pdfclown函數'textextractor'提取文本

Question

使用pdfclown庫的textextractor時出現錯誤。 我使用的代碼是

TextExtractor textExtractor = new TextExtractor(true, true);
for(final Page page : file.getDocument().getPages())
{
  System.out.println("\nScanning page " + (page.getIndex()+1) + "...\n");

  //  Extract the page text!
  Map textStrings = textExtractor.extract(page);

我得到的部分錯誤是

exception in thread 'main' java.lang.exceptionininitializer error
at org.pdfclown.document.contents.fonts.encoding.put
at ......
at ......
<about 30 such lines>
caused by java.lang.nullpointerexception
at java.io.reader.<init><Reader.java:78>
at java.io.inputstreamreader
<about 30 lines more>

我還發現，例如當我的pdf包含一些項目符號時，就會發生這種情況

項目1
項目2
項目3

請幫助我從此類pdf中提取文本。

Answer 1

（以下評論證明是解決方案：）

使用您的highlighter.java類（在您的Google驅動器中提供注釋）以及當前的PDF Clown主干版本作為jar，處理PDF時不會發生事件，尤其是沒有NullPointerException （不過，突出顯示的部分位置不正確）。

但是，在查看了共享的Google驅動器內容之后，我假設您沒有使用PDF Clown jar，而是僅從發行源文件夾中編譯了類並使用了它們。

但是，PDF Clown jar文件包含其他資源 ，因此您的安裝程序不包括這些資源。 從而：

您highlighter.java必須與在類路徑pdfclown.jar使用。

使用pdfclown函數'textextractor'提取文本

問題描述

1 個解決方案

解決方案1
0 已采納 2013-05-20 09:19:10

使用pdfclown函數&#39;textextractor&#39;提取文本

問題描述

1 個解決方案

解決方案1 0 已采納 2013-05-20 09:19:10

使用pdfclown函數'textextractor'提取文本

解決方案1
0 已采納 2013-05-20 09:19:10