简体   繁体   English

运行NUnit时,第三方Pdf库明显慢

[英]3rd party Pdf library significantly slower when running NUnit

I am evaluating Winnovative's PdfToText library and have run into something that concerns me. 我正在评估Winnovative的PdfToText库,并遇到了与我有关的问题。

Everything runs fine and I am able to extract the text content from a small 20k or less pdf immediately if I am running a console application. 一切运行正常,如果我运行控制台应用程序,则能够立即从20k以下的pdf中提取文本内容。 However, if I call the same code from the NUnit gui running it takes 15-25 seconds (I've verified it's PdfToText by putting a breakpoint on the line that extracts the text and hitting F10 to see how long it takes to advance to the next line). 但是,如果我从运行的NUnit gui调用相同的代码,则需要15到25秒的时间(我已经通过在提取文本的行上放置一个断点并按F10键来查看前进到该行需要多长时间来验证了它是PdfToText。下一行)。

This concerns me because I'm not sure where to lay blame since I don't know the cause. 这使我感到担忧,因为我不知道原因,所以我不确定应该在哪里指责。 Is there a problem with NUnit or PdfToText? NUnit或PdfToText是否存在问题? All I want to do is extract the text from a pdf, but 20 seconds is completely unreasonable if I'm going to see this behavior under certain conditions. 我要做的只是从pdf中提取文本,但是如果我要在特定条件下查看此行为,则20秒是完全不合理的。 If it's just when running NUnit, that's acceptable, but otherwise I'll have to look elsewhere. 如果只是在运行NUnit时,那是可以接受的,但是否则我将不得不去别处。

It's easier to demonstrate the problem using a complete VS Solution (2010), so here's the link to make it easier to setup and run (no need to download NUnit or PdfToText or even a sample pdf): http://dl.dropbox.com/u/273037/PdfToTextProblem.zip (You may have to change the reference to PdfToText to use the x86 dll if you're running on a 32-bit machine). 使用完整的VS解决方案(2010)可以更轻松地演示问题,因此,这里的链接使设置和运行变得更容易(无需下载NUnit或PdfToText甚至样本pdf): http://dl.dropbox。 com / u / 273037 / PdfToTextProblem.zip (如果您在32位计算机上运行,​​则可能必须更改对PdfToText的引用才能使用x86 dll)。

Just hit F5 and the NUnit Gui runner will load. 只需按F5键,就会加载NUnit Gui跑步程序。

I'm not tied to this library, if you have suggestions, I've tried iTextSharp (way too expensive for 2 lines of code), and looked at Aspose (I didn't try it, but the SaaS license is $11k). 我不依赖于此库,如果您有建议,我已经尝试过iTextSharp(对于两行代码来说太贵了),然后查看了Aspose(我没有尝试过,但是SaaS许可证是11,000美元) 。 But they either lack the required functionality or are way too expensive. 但是它们要么缺少必需的功能,要么太昂贵了。

(comment turned into answer) (评论变成答案)

How complex are your PDFs? 您的PDF有多复杂? The 4.1.6 version of iText allows for a closed sourced solution. iText的4.1.6版本允许使用封闭源解决方案。 Although 4.1.6 doesn't directly have a text extractor it isn't too terribly hard to write one using the PdfReader and GetPageContent(). 尽管4.1.6没有直接的文本提取器,但是使用PdfReader和GetPageContent()编写文本提取器并不是很困难。

Below is the code I used to extract the text from the PDF using iTextSharp v4.1.6 . 以下是我使用iTextSharp v4.1.6从PDF提取文本的代码。 If it seems overly verbose, it's related to how I'm using it and the flexibility required. 如果它看起来过于冗长,则与我的使用方式以及所需的灵活性有关。

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using iTextSharp.text.pdf;

namespace ClassLibrary1
{
    public class PdfToken
    {
        private PdfToken(int type, string value)
        {
            Type = type;
            Value = value;
        }

        public static PdfToken Create(PRTokeniser tokenizer)
        {
            return new PdfToken(tokenizer.TokenType, tokenizer.StringValue);
        }

        public int Type { get; private set; }
        public string Value { get; private set; }
        public bool IsOperand
        {
            get
            {
                return Type == PRTokeniser.TK_OTHER;
            }
        }
    }

    public class PdfOperation
    {
        public PdfOperation(PdfToken operationToken, IEnumerable<PdfToken> arguments)
        {
            Name = operationToken.Value;
            Arguments = arguments;
        }

        public string Name { get; private set; }
        public IEnumerable<PdfToken> Arguments { get; private set; }
    }

    public interface IPdfParsingStrategy
    {
        void Execute(PdfOperation op);
    }

    public class PlainTextParsingStrategy : IPdfParsingStrategy
    {
        StringBuilder text = new StringBuilder();

        public PlainTextParsingStrategy()
        {

        }

        public String GetText()
        {
            return text.ToString();
        }

        #region IPdfParsingStrategy Members

        public void Execute(PdfOperation op)
        {
            // see Adobe PDF specs for additional operations
            switch (op.Name)
            {
                case "TJ":
                    PrintText(op);
                    break;
                case "Tm":
                    SetMatrix(op);
                    break;
                case "Tf":
                    SetFont(op);
                    break;
                case "S":
                    PrintSection(op);
                    break;
                case "G":
                case "g":
                case "rg":
                    SetColor(op);
                    break;
            }
        }

        #endregion

        bool newSection = false;

        private void PrintSection(PdfOperation op)
        {
            text.AppendLine("------------------------------------------------------------");
            newSection = true;
        }

        private void PrintNewline(PdfOperation op)
        {
            text.AppendLine();
        }

        private void PrintText(PdfOperation op)
        {
            if (newSection)
            {
                newSection = false;
                StringBuilder header = new StringBuilder();
                PrintText(op, header);
            }

            PrintText(op, text);
        }

        private static void PrintText(PdfOperation op, StringBuilder text)
        {
            foreach (PdfToken t in op.Arguments)
            {
                switch (t.Type)
                {
                    case PRTokeniser.TK_STRING:
                        text.Append(t.Value);
                        break;
                    case PRTokeniser.TK_NUMBER:
                        text.Append(" ");
                        break;
                }
            }
        }

        String lastFont = String.Empty;
        String lastFontSize = String.Empty;

        private void SetFont(PdfOperation op)
        {
            var args = op.Arguments.ToList();
            string font = args[0].Value;
            string size = args[1].Value;

            //if (font != lastFont || size != lastFontSize)
            //    text.AppendLine();

            lastFont = font;
            lastFontSize = size;
        }

        String lastX = String.Empty;
        String lastY = String.Empty;

        private void SetMatrix(PdfOperation op)
        {
            var args = op.Arguments.ToList();
            string x = args[4].Value;
            string y = args[5].Value;

            if (lastY != y)
                text.AppendLine();
            else if (lastX != x)
                text.Append(" ");

            lastX = x;
            lastY = y;
        }

        String lastColor = String.Empty;

        private void SetColor(PdfOperation op)
        {
            lastColor = PrintCommand(op).Replace(" ", "_");
        }

        private static string PrintCommand(PdfOperation op)
        {
            StringBuilder text = new StringBuilder();
            foreach (PdfToken t in op.Arguments)
                text.AppendFormat("{0} ", t.Value);
            text.Append(op.Name);
            return text.ToString();
        }

    }
}

And here's how I call it: 这就是我所说的:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;
using iTextSharp.text.pdf;

namespace ClassLibrary1
{
    public class PdfExtractor
    {
        public static string GetText(byte[] pdfBuffer)
        {
            PlainTextParsingStrategy strategy = new PlainTextParsingStrategy();
            ParsePdf(pdfBuffer, strategy);
            return strategy.GetText();
        }

        private static void ParsePdf(byte[] pdf, IPdfParsingStrategy strategy)
        {
            PdfReader reader = new PdfReader(pdf);

            for (int i = 1; i <= reader.NumberOfPages; i++)
            {
                byte[] page = reader.GetPageContent(i);
                if (page != null)
                {
                    PRTokeniser tokenizer = new PRTokeniser(page);
                    List<PdfToken> parameters = new List<PdfToken>();

                    while (tokenizer.NextToken())
                    {
                        var token = PdfToken.Create(tokenizer);
                        if (token.IsOperand)
                        {
                            strategy.Execute(new PdfOperation(token, parameters));
                            parameters.Clear();
                        }
                        else
                        {
                            parameters.Add(token);
                        }
                    }
                }
            }

        }
    }
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM