简体   繁体   English

将巨大的40000页pdf分成单页,itextsharp,outofmemoryexception

[英]split huge 40000 page pdf into single pages, itextsharp, outofmemoryexception

I am getting huge PDF files with lots of data. 我正在获取包含大量数据的巨大PDF文件。 The current PDF is 350 MB and has about 40000 pages. 当前的PDF为350 MB,大约有40000页。 It would of course have been nice to get smaller PDFs, but this is what I have to work with now :-( 获得较小的PDF当然很不错,但这就是我现在要处理的事情:-(

I can open it in acrobat reader with some delay when loading but after that acrobat reader is quick. 我可以在acrobat阅读器中打开它,但在加载时会有一些延迟,但在那之后acrobat阅读器很快。

Now I need to split the huge file into single pages, then try to read some recipient data from the pdf pages, and then send the one or two pages that each recipient should get to each particular recipient. 现在我需要将大文件拆分成单个页面,然后尝试从pdf页面读取一些收件人数据,然后将每个收件人应该获得的一个或两个页面发送给每个特定收件人。

Here is my very small code so far using itextsharp: 到目前为止,这是我使用itextsharp的非常小的代码:

var inFileName = @"huge350MB40000pages.pdf";
PdfReader reader = new PdfReader(inFileName);
var nbrPages = reader.NumberOfPages;
reader.Close();

What happens is it comes to the second line "new PdfReader" then stays there for perhaps 10 minutes, the process gets to about 1.7 GB in size, and then I get an OutOfMemoryException. 接下来第二行“new PdfReader”会在那里停留大约10分钟,进程大小达到1.7 GB,然后我得到一个OutOfMemoryException。

I think the "new PdfReader" attempts to read the entire PDF into memory. 我认为“新的PdfReader”试图将整个PDF读入内存。

Is there some other/better way to do this? 有没有其他/更好的方法来做到这一点? For example, can I somehow read only a part of a PDF file into memory instead of all of it at once? 例如,我可以以某种方式只将PDF文件的一部分读入内存而不是一次只读取所有内容吗? Could it work better using some other library than itextsharp? 使用除itextsharp之外的其他库可以更好地工作吗?

From what I have read, it looks like when instantiating the PdfReader that you should use the constructor that takes in a RandomAccessFileOrArray object. 从我读过的内容看,在实例化PdfReader时,您应该使用接收RandomAccessFileOrArray对象的构造函数。 Disclaimer: I have not tried this out myself. 免责声明:我自己没试过。

iTextSharp.text.pdf.PdfReader reader = new iTextSharp.text.pdf.PdfReader(new iTextSharp.text.pdf.RandomAccessFileOrArray(@"C:\PDFFile.pdf"), null);

This is a total shot in the dark, and I haven't tested this code - it's a code extract from the 'iText In Action' book that is given as an example of how to deal with large PDF files. 这是在黑暗中完全拍摄的,我还没有测试过这段代码 - 它是“iText In Action”一书中的代码摘录,作为如何处理大型PDF文件的示例。 The code is in Java but should be fairly easy to convert - 代码是用Java编写的,但应该很容易转换 -

This is the method that loads everything into memory - 这是将所有内容加载到内存中的方法 -

PdfReader reader;
long before;
before = getMemoryUse();
reader = new PdfReader(
"HelloWorldToRead.pdf", null);
System.out.println("Memory used by the full read: "
+ (getMemoryUse() - before));

This is the memory saving way, where the document should be loaded bit-by-bit as required - 这是节省内存的方式,文档应根据需要逐位加载 -

before = getMemoryUse();
reader = new PdfReader(
new RandomAccessFileOrArray("HelloWorldToRead.pdf"), null);
System.out.println("Memory used by the partial read: "
+ (getMemoryUse() - before));

You might be able to use Ghostscript directly. 您可以直接使用Ghostscript。 http://svn.ghostscript.com/ghostscript/tags/ghostscript-9.02/doc/Use.htm#One_page_per_file http://svn.ghostscript.com/ghostscript/tags/ghostscript-9.02/doc/Use.htm#One_page_per_file

For reading the recipient data pdftextstream might be a good choice. 为了阅读收件人数据,pdftextstream可能是一个不错的选择。

PDF Toolkit is quite useful for these types of tasks. PDF Toolkit对于这些类型的任务非常有用。 Haven't tried it with such a huge file yet though. 尽管如此,还没有尝试过如此庞大的文件。

Could it work better using some other library than itextsharp? 使用除itextsharp之外的其他库可以更好地工作吗?

Please try Aspose.Pdf for .NET which allows you to split the PDF into single pages or you could split the PDF to different sets of pages in various ways, either using files or memory streams. 请尝试使用Aspose.Pdf for .NET ,它允许您将PDF拆分为单个页面,或者您可以使用文件或内存流以各种方式将PDF拆分为不同的页面集 API is very simple to learn and use. API非常易于学习和使用。 It works with large PDF files having large number of pages. 它适用于具有大量页面的大型PDF文件。

Disclosure: I work as developer evangelist at Aspose. 披露:我在Aspose担任开发人员传播者。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM