简体   繁体   English

使用 itextsharp 根据大小将 pdf 拆分为较小的 pdf

[英]Using itextsharp to split a pdf into smaller pdf's based on size

So we have some really inefficient code that splits a pdf into smaller chunks based on a maximum size allowed.所以我们有一些非常低效的代码,它根据允许的最大大小将 pdf 分成更小的块。 Aka.阿卡。 if the max size is 10megs, an 8 meg file would be skipped, while a 16 meg file would be split based on the number of pages.如果最大大小为 10megs,则将跳过 8meg 文件,而将根据页数拆分 16meg 文件。

This is code that I inherited and feel like there has got to be a more efficient way to do this that requiring only one method and less instantiation of objects.这是我继承的代码,我觉得必须有一种更有效的方法来做到这一点,只需要一种方法和更少的对象实例化。

We use the following code to call the methods:我们使用以下代码来调用这些方法:

        List<int> splitPoints = null;
        List<byte[]> documents = null;

        splitPoints = this.GetPDFSplitPoints(currentDocument, maxSize);
        documents = this.SplitPDF(currentDocument, maxSize, splitPoints);

Methods:方法:

    private List<int> GetPDFSplitPoints(IClaimDocument currentDocument, int maxSize)
    {
        List<int> splitPoints = new List<int>();
        PdfReader reader = null;
        Document document = null;
        int pagesRemaining = currentDocument.Pages;

        while (pagesRemaining > 0)
        {
            reader = new PdfReader(currentDocument.Data);
            document = new Document(reader.GetPageSizeWithRotation(1));

            using (MemoryStream ms = new MemoryStream())
            {
                PdfCopy copy = new PdfCopy(document, ms);
                PdfImportedPage page = null;

                document.Open();

                //Add pages until we run out from the original
                for (int i = 0; i < currentDocument.Pages; i++)
                {
                    int currentPage = currentDocument.Pages - (pagesRemaining - 1);

                    if (pagesRemaining == 0)
                    {
                        //The whole document has bee traversed
                        break;
                    }

                    page = copy.GetImportedPage(reader, currentPage);
                    copy.AddPage(page);

                    //If the current collection of pages exceeds the maximum size, we save off the index and start again
                    if (copy.CurrentDocumentSize > maxSize)
                    {
                        if (i == 0)
                        {
                            //One page is greater than the maximum size
                            throw new Exception("one page is greater than the maximum size and cannot be processed");
                        }

                        //We have gone one page too far, save this split index   
                        splitPoints.Add(currentDocument.Pages - (pagesRemaining - 1));
                        break;
                    }
                    else
                    {
                        pagesRemaining--;
                    }
                }

                page = null;

                document.Close();
                document.Dispose();
                copy.Close();
                copy.Dispose();
                copy = null;
            }
        }

        if (reader != null)
        {
            reader.Close();
            reader = null;
        }

        document = null;

        return splitPoints;
    }

    private List<byte[]> SplitPDF(IClaimDocument currentDocument, int maxSize, List<int> splitPoints)
    {
        var documents = new List<byte[]>();
        PdfReader reader = null;
        Document document = null;
        MemoryStream fs = null;
        int pagesRemaining = currentDocument.Pages;

        while (pagesRemaining > 0)
        {
            reader = new PdfReader(currentDocument.Data);
            document = new Document(reader.GetPageSizeWithRotation(1));

            fs = new MemoryStream();
            PdfCopy copy = new PdfCopy(document, fs);
            PdfImportedPage page = null;

            document.Open();

            //Add pages until we run out from the original
            for (int i = 0; i <= currentDocument.Pages; i++)
            {
                int currentPage = currentDocument.Pages - (pagesRemaining - 1);
                if (pagesRemaining == 0)
                {
                    //We have traversed all pages
                    //The call to copy.Close() MUST come before using fs.ToArray() because copy.Close() finalizes the document
                    fs.Flush();
                    copy.Close();
                    documents.Add(fs.ToArray());
                    document.Close();
                    fs.Dispose();
                    break;
                }

                page = copy.GetImportedPage(reader, currentPage);
                copy.AddPage(page);
                pagesRemaining--;

                if (splitPoints.Contains(currentPage + 1) == true)
                {
                    //Need to start a new document
                    //The call to copy.Close() MUST come before using fs.ToArray() because copy.Close() finalizes the document
                    fs.Flush();
                    copy.Close();
                    documents.Add(fs.ToArray());
                    document.Close();
                    fs.Dispose();
                    break;
                }
            }

            copy = null;
            page = null;

            fs.Dispose();
        }

        if (reader != null)
        {
            reader.Close();
            reader = null;
        }

        if (document != null)
        {
            document.Close();
            document.Dispose();
            document = null;
        }

        if (fs != null)
        {
            fs.Close();
            fs.Dispose();
            fs = null;
        }

        return documents;
    }

As far as I can tell, the only code online that I can see is VB and doesn't necessarily address the size issue.据我所知,我能看到的唯一在线代码是 VB,不一定解决大小问题。

UPDATE :更新

We're experiencing OutofMemory exceptions and I believe it's an issue with the Large Object Heap.我们遇到了 OutofMemory 异常,我认为这是大对象堆的问题。 So one thought was to reduce the code footprint and that would possibly reduce the number of large objects on the heap.所以一个想法是减少代码占用空间,这可能会减少堆上大对象的数量。

Basically this is part of a loop that goes through any number of PDF's, and then splits them and stores them in the database.基本上,这是循环的一部分,该循环遍历任意数量的 PDF,然后拆分它们并将它们存储在数据库中。 Right now, we had to change the method from doing all of them at once (last run was 97 pdf's of various sizes), to running 5 pdf's through the system every 5 minutes.现在,我们必须将方法从一次执行所有这些操作(上次运行是 97 个不同大小的 pdf)更改为每 5 分钟通过系统运行 5 个 pdf。 This is not ideal and won't scale well when we ramp up the tool to more clients.这并不理想,而且当我们将该工具扩展到更多客户时,也无法很好地扩展。

(we're dealing with 50 -100 meg pdf's, but they could be larger). (我们正在处理 50 -100 meg pdf,但它们可能更大)。

I also inherited this exact code, and there appears to be a major flaw in it.我也继承了这个确切的代码,它似乎有一个重大缺陷。 In the GetPDFSplitPoints method, it's checking the total size of the copied pages against maxsize to determine at which page to split the file.GetPDFSplitPoints方法中,它根据 maxsize 检查复制页面的总大小,以确定在哪个页面拆分文件。
In the SplitPDF method, when it reaches the page where the split occurs, sure enough the MemoryStream at that point is below the maximum size allowed, and one more page would put it over the limit.SplitPDF方法中,当它到达发生拆分的页面时,该点的 MemoryStream 肯定低于允许的最大大小,再多一页就会超过限制。 But after document.Close();但是在document.Close(); is executed, much more is added to the MemoryStream (in one example PDF I worked with, the Length of the MemoryStream went from 9 MB to 19 MB before and after the document.Close ).执行后, MemoryStream添加了更多内容(在我使用的一个 PDF 示例中, MemoryStreamLengthdocument.Close之前和之后从 9 MB 变为 19 MB。Close)。 My understanding is that all the necessary resources for the copied pages are added upon Close .我的理解是,复制页面的所有必要资源都在Close添加。
I'm guessing I'll have to rewrite this code completely to ensure I don't exceed the max size while retaining the integrity of the original pages.我猜我必须完全重写这段代码,以确保在保留原始页面完整性的同时不超过最大大小。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM