简体   繁体   English

c# itext7/itextsharp:如何在 PDF 文件中找到特定术语的坐标?

[英]c# itext7/itextsharp : how to find co-ordinates of particular term in PDF file?

I am using itext7/itextsharp in c#.我在 c# 中使用 itext7/itextsharp。

how can I implement to search co-ordinates of multiple occurrences of particular word?如何实现搜索特定单词多次出现的坐标?

Finding the position of search terms (more generally, search expressions) is the job of the iText 7 RegexBasedLocationExtractionStrategy .查找搜索词(更一般地说,搜索表达式)的位置是 iText 7 RegexBasedLocationExtractionStrategy It allows you to search for all matches of a regular expression on a page and returns these matches including the exact matched text and its location on the page:它允许您在页面上搜索正则表达式的所有匹配项,并返回这些匹配项,包括完全匹配的文本及其在页面上的位置:

PdfDocument pdfDocument = ...

for (int page = 1; page <= pdfDocument.GetNumberOfPages(); page++)
{
    Console.WriteLine("Page {0}", page);
    RegexBasedLocationExtractionStrategy strategy = new RegexBasedLocationExtractionStrategy(SEARCH_EXPRESSION);
    new PdfCanvasProcessor(strategy).ProcessPageContent(pdfDocument.GetPage(page));
    foreach (IPdfTextLocation location in strategy.GetResultantLocations())
    {
        if (location != null)
        {
            Rectangle rect = location.GetRectangle();
            Console.WriteLine(String.Format(CultureInfo.InvariantCulture, " - '{0}' at ({1}, {2}), {3}\u00d7{4}", location.GetText(), rect.GetX(), rect.GetY(), rect.GetWidth(), rect.GetHeight()));
        }
    }
}

For example consider the PDF ENaB 20180317.pdf originally shared by the OP of this question :例如,考虑最初由这个问题的 OP 共享的 PDF ENaB 20180317.pdf

ENaB 20180317_Page_1.png ENaB 20180317_Page_2.png

There are multiple occurrences of "SAN XXX " with different XXX in those tables.在这些表中多次出现具有不同XXX的“SAN XXX ”。 Applying the code above with the regular expression to that file results in:将带有正则表达式的上述代码应用于该文件会导致:

Page 1
 - 'SAN IGNACIO' at (183.6127, 81.85992), 45.49265×9.500404
 - 'SAN CERNIN' at (260.1665, 203.9058), 42.94177×9.500397
 - 'SAN IGNACIO' at (183.6058, 244.58), 45.49265×9.500397
 - 'SAN CERNIN' at (239.0477, 244.58), 42.94247×9.500397
 - 'SAN JORGE' at (392.0537, 298.8239), 40.27756×9.500397
 - 'SAN CERNIN' at (183.6128, 407.3039), 42.93692×9.500397
 - 'SAN IGNACIO' at (183.6058, 434.42), 45.49265×9.500397
 - 'SAN IGNACIO' at (392.0432, 434.42), 45.48703×9.500397
Page 2
 - 'SAN ADRIAN' at (279.3961, 136.1134), 42.57495×9.500397
 - 'SAN CERNIN' at (183.6475, 149.6715), 42.93692×9.500397
 - 'SAN CERNIN' at (392.0876, 149.6715), 42.94247×9.500397
 - 'SAN CERNIN' at (183.6127, 231.0199), 42.9418×9.500397
 - 'SAN CERNIN' at (392.0528, 231.0199), 42.94247×9.500397
 - 'SAN IGNACIO' at (308.1654, 312.3896), 45.48703×9.500397
 - 'SAN ADRIAN' at (472.0908, 339.5058), 42.57428×9.500397
 - 'SAN CERNIN' at (263.1662, 380.18), 42.94247×9.500397

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM