简体   繁体   English

使用pdfclown函数'textextractor'提取文本

[英]extracting text from using pdfclown function 'textextractor'

i am getting an error while using textextractor of pdfclown library. 使用pdfclown库的textextractor时出现错误。 The code i used is 我使用的代码是

TextExtractor textExtractor = new TextExtractor(true, true);
for(final Page page : file.getDocument().getPages())
{
  System.out.println("\nScanning page " + (page.getIndex()+1) + "...\n");

  //  Extract the page text!
  Map textStrings = textExtractor.extract(page);

a part of the error i got is 我得到的部分错误是

exception in thread 'main' java.lang.exceptionininitializer error
at org.pdfclown.document.contents.fonts.encoding.put
at ......
at ......
<about 30 such lines>
caused by java.lang.nullpointerexception
at java.io.reader.<init><Reader.java:78>
at java.io.inputstreamreader
<about 30 lines more>

I also found out that this happens when my pdf contains some bullets for example 我还发现,例如当我的pdf包含一些项目符号时,就会发生这种情况

  • item 1 项目1
  • item 2 项目2
  • item 3 项目3

Plz help me out to extract the text from such pdfs. 请帮助我从此类pdf中提取文本。

(The following comment turned out to be the solution:) (以下评论证明是解决方案:)

Using your highlighter.java class (provided on your google drive in a comment) together with the current PDF Clown trunk version as jar, the PDF was processed without incident, especially without NullPointerException (the highlights partially were not at the right position, though). 使用您的highlighter.java类( 在您的Google驱动器中提供注释)以及当前的PDF Clown主干版本作为jar,处理PDF时不会发生事件,尤其是没有NullPointerException (不过,突出显示的部分位置不正确) 。

After looking at your shared google drive contents, though, I assumed you did not use a PDF Clown jar but instead merely compiled the classes from the distribution source folder and used them. 但是,在查看了共享的Google驱动器内容之后,我假设您没有使用PDF Clown jar,而是仅从发行源文件夹中编译了类并使用了它们。

The PDF Clown jar files contain additional ressources, though, which your setup consequentially did not include. 但是,PDF Clown jar文件包含其他资源 ,因此您的安装程序不包括这些资源 Thus: 从而:

Your highlighter.java has to be used with pdfclown.jar in the classpath. highlighter.java必须与在类路径pdfclown.jar使用。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM