简体   繁体   English

iTextSharp GetTextFromPage不返回

[英]iTextSharp GetTextFromPage does not return

This pertains to using iTextSharp 5.5.8 or 5.5.9, my test harness is: 这与使用iTextSharp 5.5.8或5.5.9有关,我的测试工具是:

{
  PdfReader pdfReader = null;
  StringBuilder actual = new StringBuilder();

  try
  {
    pdfReader = new PdfReader(@"Quotation for Macbook 6-16.pdf");
  }
  catch (iTextSharp.text.exceptions.BadPasswordException bpe)
  {
    actual.AppendLine(string.Format("Exception: Bad Password {0}", bpe));
  }
  catch (Exception ex)
  {
    actual.AppendLine(string.Format("Exception: PDFReader {0}", ex));
  }

  int pages = pdfReader.NumberOfPages;
  for (int page = 1; page <= pages; page++)
  {
    try
    {
      String s = PdfTextExtractor.GetTextFromPage(pdfReader, page);
      actual.AppendLine(string.Format("{0}", s));
    }
    catch (Exception ex)
    {
      actual.AppendLine(string.Format("Exception PDF Page {0}: {1}", page, ex));
    }
  }

  foreach (var field in pdfReader.AcroFields.Fields)
  {
    actual.AppendLine(string.Format("{0}: {1}", field.Key, pdfReader.AcroFields.GetField(field.Key)));
  }
}

I have processed thousands of PDF files calling the GetTextFromPage, but encountered a particular PDF that does not return at all. 我已经处理了数千个调用GetTextFromPage的PDF文件,但是遇到了一个完全不返回的特定PDF。 I downloaded the code from GitHub and walked through it processing the file and it looks like the conditions for the LineDashPattern when it calls InitFirst cause the continuous loop here is the code from LineDashPattern.cs 我从GitHub下载了代码,并逐步处理了文件处理过程,它看起来像LineDashPattern的条件,当它调用InitFirst时,导致连续循环,这是LineDashPattern.cs中的代码

        private void InitFirst(float phase) {
        if (dashArray.Size > 0) {
            while (phase > 0) {
                phase -= dashArray.GetAsNumber(currentIndex).FloatValue;
                currentIndex = (currentIndex + 1) % DashArray.Size;
                elemOrdinalNumber++;
            }

            if (phase < 0) {
                --elemOrdinalNumber;
                --currentIndex;
                currentElem = new DashArrayElem(-phase, IsEven(elemOrdinalNumber));
            } else {
                currentElem = new DashArrayElem(dashArray.GetAsNumber(currentIndex).FloatValue, 
                    IsEven(elemOrdinalNumber));
            }
        }
    }

The phase that is passed in is 6.44245E+8 there are two entries in the dashArray 28.8, and 9.6 however having such a large number for the phase causes the first while get stuck because the 28.8 is not significant enough to decrease the phase based on float's resolution. 即在通过相位是6.44245E + 8中有dashArray 28.8两个条目,以及9.6然而有这样的阶段大量导致第一被卡住,因为28.8不够显著减少基于相浮点数的分辨率。

I do not know enough about the internals or I would consider making changes. 我对内部知识了解不足,否则我会考虑进行更改。

I am really only interested in extracting the text, so if there is a setting I can implement to filter out the line processing that would work for me too. 我真的只对提取文本感兴趣,因此,如果有设置,我可以实施以过滤出对我也有用的行处理。

I updated the LineDashPattern.cs file. 我更新了LineDashPattern.cs文件。 I am using the iTextSharp, and as far as I know the 5.5.9 is the latest release, so iText 7 might be Java. 我正在使用iTextSharp,据我所知5.5.9是最新版本,因此iText 7可能是Java。

Anyhow, here is the code that I updated. 无论如何,这是我更新的代码。 I added a elts (sum of the line elements) as a private field in the class, updated the dashArray property set routine to update elts based on the current dashArray , and finally updated the InitFirst method to divide the phase by the elts doing a bulk of the computation in the one statement then falling into the original code to find the actual element. 我添加了一个elts (行元素的总和)在类私有字段,更新dashArray属性集例行更新elts基于当前dashArray ,最后更新InitFirst方法来划分由所述相位elts做一个散装一个语句中的计算结果,然后落入原始代码中以查找实际元素。

I do not know in general what phase value are typically passed into the routine, but my value if they could have adjusted the phase would have looped nearly 17 million times, so this change should be significantly faster and since it was called multiple times for this PDF it becomes an even greater performance improvement, not to mention addressing the bug. 我通常不知道通常将什么相位值传递到例程中,但是如果他们可以调整相位,则我的值将循环近1700万次,因此此更改应显着更快,并且为此被多次调用PDF可以带来更大的性能改进,更不用说解决该错误了。 The full file code is below: 完整的文件代码如下:

/*
 * $Id$
 *
 * This file is part of the iText (R) project.
 * Copyright (c) 1998-2016 iText Group NV
 * Authors: Bruno Lowagie, Paulo Soares, et al.
 *
 * This program is free software; you can redistribute it and/or modify
 * it under the terms of the GNU Affero General Public License version 3
 * as published by the Free Software Foundation with the addition of the
 * following permission added to Section 15 as permitted in Section 7(a):
 * FOR ANY PART OF THE COVERED WORK IN WHICH THE COPYRIGHT IS OWNED BY
 * ITEXT GROUP. ITEXT GROUP DISCLAIMS THE WARRANTY OF NON INFRINGEMENT
 * OF THIRD PARTY RIGHTS
 *
 * This program is distributed in the hope that it will be useful, but
 * WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
 * or FITNESS FOR A PARTICULAR PURPOSE.
 * See the GNU Affero General Public License for more details.
 * You should have received a copy of the GNU Affero General Public License
 * along with this program; if not, see http://www.gnu.org/licenses or write to
 * the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor,
 * Boston, MA, 02110-1301 USA, or download the license from the following URL:
 * http://itextpdf.com/terms-of-use/
 *
 * The interactive user interfaces in modified source and object code versions
 * of this program must display Appropriate Legal Notices, as required under
 * Section 5 of the GNU Affero General Public License.
 *
 * In accordance with Section 7(b) of the GNU Affero General Public License,
 * a covered work must retain the producer line in every PDF that is created
 * or manipulated using iText.
 *
 * You can be released from the requirements of the license by purchasing
 * a commercial license. Buying such a license is mandatory as soon as you
 * develop commercial activities involving the iText software without
 * disclosing the source code of your own applications.
 * These activities include: offering paid services to customers as an ASP,
 * serving PDFs on the fly in a web application, shipping iText with a closed
 * source product.
 *
 * For more information, please contact iText Software Corp. at this
 * address: sales@itextpdf.com
 */

using System.util;
using iTextSharp.awt.geom;

namespace iTextSharp.text.pdf.parser {

    /**
     * Represents the line dash pattern. The line dash pattern shall control the pattern
     * of dashes and gaps used to stroke paths. It shall be specified by a dash array and
     * a dash phase.
     *
     * @since 5.5.6
     */
    public class LineDashPattern {

        private PdfArray dashArray;
        private float dashPhase;

        private int currentIndex;
        private int elemOrdinalNumber = 1;
        private DashArrayElem currentElem;
        private float elts = 0.0F;

        /**
         * Creates new {@link LineDashPattern} object.
         * @param dashArray The dash array. See {@link #getDashArray()}
         * @param dashPhase The dash phase. See {@link #getDashPhase()}
         */
        public LineDashPattern(PdfArray dashArray, float dashPhase) {
            this.dashArray = new PdfArray(dashArray);
            this.dashPhase = dashPhase;
            InitFirst(dashPhase);
        }

        /**
         * Getter and setter for the dash array.
         *
         * The dash array’s elements is number that specify the lengths of
         * alternating dashes and gaps; the numbers are nonnegative. The
         * elements are expressed in user space units.
         *
         * @return The dash array.
         */
        public PdfArray DashArray {
            get { return dashArray; }
            set 
            { 
              dashArray = value;
              float elts = 0.0F;
              for (int i = 0; i < dashArray.Size; i++)
              {
                elts += dashArray.GetAsNumber(i).FloatValue;
              }
            }
        }

        /**
         * Getter and setter for the dash phase.
         *
         * The dash phase shall specify the distance into the dash pattern at which
         * to start the dash. The elements are expressed in user space units.
         *
         * @return The dash phase.
         */
        public float DashPhase {
            get { return dashPhase; }
            set { dashPhase = value; }
        }

        /**
         * Calculates and returns the next element which is either gap or dash.
         * @return The next dash array's element.
         */
        public DashArrayElem Next() {
            DashArrayElem ret = currentElem;

            if (dashArray.Size > 0) {
                currentIndex = (currentIndex + 1) % DashArray.Size;
                currentElem = new DashArrayElem(dashArray.GetAsNumber(currentIndex).FloatValue,
                    IsEven(++elemOrdinalNumber));
            }

            return ret;
        }

        /**
         * Checks whether the dashed pattern is solid or not. It's solid when the
         * size of a dash array is even and sum of all the units off in the array
         * is 0.<br/>
         * For example: [3 0 4 0 5 0 6 0] (sum is 0), [3 0 4 0 5 1] (sum is 1).
         */
        public bool IsSolid() {
            if (dashArray.Size % 2 != 0) {
                return false;
            }

            float unitsOffSum = 0;

            for (int i = 1; i < dashArray.Size; i += 2) {
                unitsOffSum += dashArray.GetAsNumber(i).FloatValue;
            }

            return Util.Compare(unitsOffSum, 0) == 0;
        }

        /**
         * Resets the dash array so that the {@link #next()} method will start
         * from the beginning of the dash array.
         */
        public void Reset() {
            currentIndex = 0;
            elemOrdinalNumber = 1;
            InitFirst(dashPhase);
        }

        private void InitFirst(float phase) {
            if (dashArray.Size > 0) {
              // handle a bulk of the line pattern
              //
              if (elts > 0.0)
              {
                int occurances = (int)(phase / elts);
                elemOrdinalNumber = occurances * dashArray.Size;
                phase -= occurances * elts;

                // adjust for the final set of pattern elements
                //
                while (phase > 0)
                {
                  phase -= dashArray.GetAsNumber(currentIndex).FloatValue;
                  currentIndex = (currentIndex + 1) % DashArray.Size;
                  elemOrdinalNumber++;
                }

                if (phase < 0)
                {
                  --elemOrdinalNumber;
                  --currentIndex;
                  currentElem = new DashArrayElem(-phase, IsEven(elemOrdinalNumber));
                }
                else
                {
                  currentElem = new DashArrayElem(dashArray.GetAsNumber(currentIndex).FloatValue,
                      IsEven(elemOrdinalNumber));
                }
              }
            }
        }

        private bool IsEven(int num) {
            return (num % 2) == 0;
        }

        public class DashArrayElem {

            private float val;
            private bool isGap;

            public DashArrayElem(float val, bool isGap) {
                this.val = val;
                this.isGap = isGap;
            }

            public float Value
            {
                get { return val; }
                set { val = value; }
            }

            public bool IsGap
            {
                get { return isGap; }
                set { isGap = value; }
            }
        }
    }
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM