从PDF文件中提取文本

Question

我正在将PDFBox用于C＃.NET项目。 当执行以下代码块时，出现“ TypeInitializationException”（“ java.lang.Throwable”的类型初始值设定项引发异常）：

  FileStream stream = new FileStream(@"C:\1.pdf",FileMode.Open);

  //retrieve the pdf bytes from the stream.
  byte[] pdfbytes=new byte[65000];

  stream.Read(pdfbytes, 0, 65000);

 //get the pdf file bytes.
 allbytes = pdfbytes;

 //create a stream from the file bytes.
 java.io.InputStream ins = new java.io.ByteArrayInputStream(allbytes);
 string txt;

 //load the doc
 PDDocument doc = PDDocument.load(ins);
 PDFTextStripper stripper = new PDFTextStripper();

 //retrieve the pdf doc's text
 txt = stripper.getText(doc);
 doc.close();

例外发生在第三条陈述中：

PDDocument doc = PDDocument.load(ins);

我该怎么解决？

这是堆栈跟踪：

           at java.lang.Throwable.__<map>(Exception , Boolean )
   at org.pdfbox.pdfparser.PDFParser.parse()
   at org.pdfbox.pdmodel.PDDocument.load(InputStream input, RandomAccess scratchFile)
   at org.pdfbox.pdmodel.PDDocument.load(InputStream input)
   at At.At.ExtractTextFromPDF(InputStream fileStream) in
 C:\Users\Administrator\Documents\Visual Studio 2008\Projects\AtProject\Att\At.cs:line 61

InnerException的内部异常：

InnerException {“无法加载文件或程序集IKVM.Runtime，版本= 0.30.0.0，文化=中性，PublicKeyToken = 13235d27fcbfff58'或其依赖项之一。系统找不到指定的文件。”：“ IKVM.Runtime，版本= 0.30.0.0，区域性=中性，PublicKeyToken = 13235d27fcbfff58“} System.Exception {System.IO.FileNotFoundException}

好的，我通过将PDFBox的某些.dll文件复制到我的bin文件夹中解决了先前的问题。 但是现在我遇到了这个错误：期望='/'实际='.'-- 1 org.pdfbox.io.PushBackInputStream@283d742

是否有使用PDFBox的替代方法？ 有没有其他我可以用来从pdf文件中提取文本的可靠库。

Answer 1

好像您缺少PDFBox的某些库。 你需要：

IKVM.GNU.Classpath.dll
PDFBox的-XXXdll
FontBox-XXX-dev.dll
IKVM.Runtime.dll

阅读本主题使用C＃从PDF文件读取。 您可以在此主题的评论中找到类似问题的讨论。

Answer 2

我发现DLL文件的版本是罪魁祸首。 转到http://www.netlikon.de/docs/PDFBox-0.7.2/bin/?C=M;O=A并下载以下文件：

IKVM.Runtime.dll
IKVM.GNU.Classpath.dll
PDFBox的-0.7.2.dll

然后将它们复制到Visual Studio项目的根目录中。 右键单击该项目并添加引用，找到所有3个并将其添加。

最后是我用来将PDF解析为文本的代码

C＃

private static string TransformPdfToText(string SourceFile)
{
string content = "";
PDDocument doc = new PDDocument();
PDFTextStripper stripper = new PDFTextStripper();
doc.close();
doc = PDDocument.load(SourceFile);

try
{
content = stripper.getText(doc);
doc.close();
}
catch (Exception ex)
{
Console.WriteLine(ex.Message);
}
finally
{
doc.close();
}
return content;
}

Visual Basic

Private Function parseUsingPDFBox(ByVal filename As String) As String
    LogFile(" Attempting to parse file: " & filename)
    Dim doc As PDDocument = New PDDocument()
    Dim stripper As PDFTextStripper = New PDFTextStripper()
    doc.close()
    doc = PDDocument.load(filename)

    Dim content As String = "empty"
    Try
        content = stripper.getText(doc)
        doc.close()
    Catch ex As Exception
         LogFile(" Error parsing file: " & filename & vbcrlf & ex.Message)
    Finally
        doc.close()
    End Try
    Return content
End Function

从PDF文件中提取文本

问题描述

2 个解决方案

解决方案1
2 已采纳 2009-11-15 21:50:40

解决方案2
1 2012-01-31 18:32:35

从PDF文件中提取文本

问题描述

2 个解决方案

解决方案1 2 已采纳 2009-11-15 21:50:40

解决方案2 1 2012-01-31 18:32:35

解决方案1
2 已采纳 2009-11-15 21:50:40

解决方案2
1 2012-01-31 18:32:35