简体   繁体   English

C# iText7文本坐标提取题

[英]C# iText7 text coordinate extraction question

I am working on a PDF text extractor with iText7 and am noticing strange text coordinates on a certain PDF.我正在使用 iText7 开发 PDF 文本提取器,并注意到某个 PDF 上的奇怪文本坐标。 Most documents appear to yield x and y coordinates within the height and width of the page, but one seems to yield negatives.大多数文档似乎在页面的高度和宽度内产生 x 和 y 坐标,但有一个似乎产生负数。 I was wondering if there was a standard approach to dealing with negative coordinates here.我想知道这里是否有处理负坐标的标准方法。 This basic approach is to use positive inch measurements from a PDF and to map them to iText7 extracted text and coordinates with a 1/72 scale value for inches per dot.这种基本方法是使用来自 PDF 和 map 的正英寸测量值,将它们提取到 iText7 提取的文本和坐标,每点英寸的比例值为 1/72。

I am deriving from the LocationTextExtractionStrategy and code is as follows:我是从 LocationTextExtractionStrategy 派生的,代码如下:

        private class LocationTextListStrategy : LocationTextExtractionStrategy
        {
            private readonly List<TextRect> _textRects = new List<TextRect>();

            public List<TextRect> TextRects() => _textRects;

            public override void EventOccurred(IEventData data, EventType type)
            {
                if (!type.Equals(EventType.RENDER_TEXT))
                    return;

                var renderInfo = (TextRenderInfo)data;
                var text = renderInfo.GetCharacterRenderInfos();

                foreach (var t in text)
                {
                    if (string.IsNullOrWhiteSpace(t.GetText()))
                        continue;

                    AddTextRect(t);
                }
            }

            private void AddTextRect(TextRenderInfo t)
            {
                var letterStart = t.GetBaseline().GetStartPoint();
                var letterEnd = t.GetAscentLine().GetEndPoint();

                var newTextRect = new TextRect(
                    text: t.GetText(),
                    l: letterStart.Get(0),
                    r: letterEnd.Get(0),
                    t: letterEnd.Get(1),
                    b: letterStart.Get(1));
                
                _textRects.Add(newTextRect);
            }
        }

Each PDF page can have its own, custom coordinate system.每个 PDF 页面都可以有自己的自定义坐标系。 It is common to have the origin in the lower left corner of the page but it is not required.原点通常位于页面的左下角,但这不是必需的。

Key钥匙 Type类型 Value价值
MediaBox媒体盒 rectangle长方形 (Required; inheritable) A rectangle (see 7.9.5, "Rectangles"), expressed in default user space units, that shall define the boundaries of the physical medium on which the page shall be displayed or printed (see 14.11.2, "Page boundaries"). (必需;可继承)以默认用户空间单位表示的矩形(见 7.9.5,“矩形”),应定义显示或打印页面的物理介质的边界(见 14.11.2,“页面边界”)。
CropBox裁剪框 rectangle长方形 (Optional; Inheritable) A rectangle, expressed in default user space units, that shall define the visible region of default user space. (可选;可继承)以默认用户空间单位表示的矩形,用于定义默认用户空间的可见区域。 When the page is displayed or printed, its contents shall be clipped (cropped) to this rectangle (see 14.11.2, "Page boundaries").当页面显示或打印时,其内容应被剪裁(裁剪)到该矩形(见 14.11.2,“页面边界”)。 Default value: the value of MediaBox .默认值: MediaBox的值。

(ISO 32000-2:2017, Table 31 — Entries in a page) (ISO 32000-2:2017,表 31 - 页面中的条目)

Thus, always interpret coordinates in respect to the crop box of the page they refer to.因此,始终解释相对于它们所指页面的裁剪框的坐标。

The iText 7 class PdfPage has matching getters. iText 7 class PdfPage具有匹配的吸气剂。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM