[英]Invalid block type while using pdfbox 2.0.8
I'm trying to use PDFBox to get text from a PDF but I'm running into an exception. 我正在尝试使用PDFBox从PDF获取文本,但是遇到了一个异常。 Here's the code I'm using to do the text stripping.
这是我用来剥离文本的代码。
try
{
PDDocument document = PDDocument.load(decodedPdfDocument.getBytes());
PDFTextStripper Tstripper = new PDFTextStripper();
String st = Tstripper.getText(document);
System.out.println("Text:" + st);
}
catch (Exception e)
{
e.printStackTrace();
}
And here's the exception I'm getting. 这是我得到的例外。
java.io.IOException: java.util.zip.DataFormatException: invalid block type
at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:83)
at org.apache.pdfbox.cos.COSInputStream.create(COSInputStream.java:69)
at org.apache.pdfbox.cos.COSStream.createInputStream(COSStream.java:167)
at org.apache.pdfbox.pdmodel.PDPage.getContents(PDPage.java:155)
at org.apache.pdfbox.pdfparser.PDFStreamParser.<init>(PDFStreamParser.java:91)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:485)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:469)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:150)
at org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139)
at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391)
at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
at org.apache.pdfbox.text.PDFTextStripper.getText(PDFTextStripper.java:227)
at com.cerner.fsi.testing.xds.provideandregister.ClinicalEventInErrorTest.testProvideAndRegisterCDAWrappedDocument_ClinicalEventInError(ClinicalEventInErrorTest.java:117)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41)
at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
at org.springframework.test.context.junit4.statements.RunBeforeTestMethodCallbacks.evaluate(RunBeforeTestMethodCallbacks.java:74)
at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:31)
at org.springframework.test.context.junit4.statements.RunAfterTestMethodCallbacks.evaluate(RunAfterTestMethodCallbacks.java:83)
at org.springframework.test.context.junit4.statements.SpringRepeat.evaluate(SpringRepeat.java:72)
at org.springframework.test.context.junit4.SpringJUnit4ClassRunner.runChild(SpringJUnit4ClassRunner.java:231)
at org.springframework.test.context.junit4.SpringJUnit4ClassRunner.runChild(SpringJUnit4ClassRunner.java:88)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184)
at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
at org.springframework.test.context.junit4.statements.RunBeforeTestClassCallbacks.evaluate(RunBeforeTestClassCallbacks.java:61)
at org.springframework.test.context.junit4.statements.RunAfterTestClassCallbacks.evaluate(RunAfterTestClassCallbacks.java:71)
at com.cerner.fsi.testing.xds.EnterpriseClientSpringJUnit4ClassRunner$1$1.run(EnterpriseClientSpringJUnit4ClassRunner.java:50)
at com.cerner.fsi.testing.xds.EnterpriseClientSpringJUnit4ClassRunner$1$1.run(EnterpriseClientSpringJUnit4ClassRunner.java:1)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:356)
at com.cerner.fsi.testing.xds.EnterpriseClientSpringJUnit4ClassRunner$1.evaluate(EnterpriseClientSpringJUnit4ClassRunner.java:43)
at org.junit.runners.ParentRunner.run(ParentRunner.java:236)
at org.springframework.test.context.junit4.SpringJUnit4ClassRunner.run(SpringJUnit4ClassRunner.java:174)
at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:86)
at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:459)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:678)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:382)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:192)
Caused by: java.util.zip.DataFormatException: invalid block type
at java.util.zip.Inflater.inflateBytes(Native Method)
at java.util.zip.Inflater.inflate(Inflater.java:259)
at java.util.zip.Inflater.inflate(Inflater.java:280)
at org.apache.pdfbox.filter.FlateFilter.decompress(FlateFilter.java:108)
at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:74)
... 49 more
I've ran the PDF through validation and it's a valid 1.4 PDF. 我已经通过验证运行了PDF,它是有效的1.4 PDF。 I'm also able to extract text using both Foxit and Acrobat frontend tools.
我还可以使用Foxit和Acrobat前端工具提取文本。 Not quite sure what I'm doing wrong here, this is my first time using this API.
不太确定我在这里做错了什么,这是我第一次使用此API。
According to a comment of the OP 根据OP的评论
The
decodedPdfDocument
comes from decoding a Base64 string.decodedPdfDocument
来自解码Base64字符串。 I'm creating a new string usingnew String(Base64.decode(documentString), "UTF-8");
我正在使用
new String(Base64.decode(documentString), "UTF-8");
创建一个新的字符串new String(Base64.decode(documentString), "UTF-8");
This is the error. 这是错误。 Never ever treat binary contents (like PDFs) as text!
永远不要将二进制内容(如PDF)视为文本! Instead let the decoded PDF document remain a
byte[]
: 取而代之的是让解码的PDF文档保留一个
byte[]
:
documentString = [...your base64 encoded PDF ...];
byte[] decodedPdfDocument = Base64.decode(documentString);
...
PDDocument document = PDDocument.load(decodedPdfDocument);
PDFTextStripper textStripper = new PDFTextStripper();
String st = textStripper.getText(document);
As a background: When you do something like 作为背景:当您执行类似操作时
String string = new String(bytes, encoding);
for a byte[] bytes
, your bytes
are decoded assuming they contain text encoded to bytes using encoding
. 对于
byte[] bytes
,假设您的bytes
包含使用encoding
编码为字节的文本,则将对您的bytes
进行解码。
If your bytes
are not text encoded using encoding
, this process can be destructive, byte sequences that don't make sense according to the assumed encoding will be converted to a replacement character in the String
. 如果您的
bytes
不是使用encoding
文本编码的,则此过程可能是破坏性的,根据假定的编码没有意义的字节序列将转换为String
的替换字符。 When you later retrieve bytes from the String using getBytes()
, you'll retrieve representations of the replacement character instead of the original byte sequences. 以后使用
getBytes()
从String中检索字节时,将检索替换字符的表示形式,而不是原始字节序列。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.