简体   繁体   English

如何在C#中使用iText在PDF中查找文本基数词?

[英]How to find a text cardinal poisiton in a PDF using iText in c#?

I would like to find the cardinal position of a line (or paragraph) in a pdf which contains a given pattern. 我想在包含给定模式的pdf中找到一行(或段落)的基本位置。

For exemple I can have this problem : 例如我可以遇到这个问题:

  • In input, I have a regex (for exemple "Test.*") and a PDF containing a line (or a paragraph) which valid this regex. 在输入中,我有一个正则表达式(例如“ Test。*”)和一个包含有效此正则表达式的行(或段落)的PDF。
  • I want as an output : the list of Y positions of the lines which validate this regex. 我想要作为输出:验证此正则表达式的行的Y位置列表。

Does anyone have an idea how i can detect that positions ? 有谁知道我怎么能发现那个位置?

Thank you very much. 非常感谢你。

Eliott 艾利奥特

I can have something helpful for you but it is not fully completed. 我可以为您提供一些帮助,但尚未完全完成。 I used to write but I did not finish. 我曾经写,但没有完成。 you will be able to determine the position of the text. 您将能够确定文本的位置。 Program return each item in pdf and returns the coordinates. 程序以pdf格式返回每个项目并返回坐标。

i Use - itext7 and dotnet core 我使用-itext7和dotnet核心

string[] srcFileNames = { "1.pdf" }; string [] srcFileNames = {“ 1.pdf”}; FindTextInPdf("test", srcFileNames); FindTextInPdf(“ test”,srcFileNames);

 public void FindTextInPdf(string SearchStr, string[] sources)
 {

            foreach (var item in sources)
            {
                if (File.Exists(item))
                {
                    using (PdfReader reader = new PdfReader(item))
                    using (var doc = new PdfDocument(reader))
                    {

                        var pageCount = doc.GetNumberOfPages();

                        for (int i = 1; i <= pageCount; i++)
                        {
                            PdfPage page = doc.GetPage(i);
                            var box = page.GetCropBox();
                            var rect = new Rectangle(box.GetX(), box.GetY(), box.GetWidth(), box.GetHeight());

                            var filter = new IEventFilter[1];
                                filter[0] = new TextRegionEventFilter(rect);

                            ITextExtractionStrategy strategy = new FilteredTextEventListener(new TextLocationStrategy(), filter);
                            var str = PdfTextExtractor.GetTextFromPage(page, strategy);
                            if (str.Contains(SearchStr) == true)
                            {
                                Console.WriteLine("Searched text found in file:[ " + item + " ] page : [ " + i + " ]");
                            }

                            foreach (var d in objectResult)
                            {
                                Console.WriteLine("Char >"+ d.Text+ " X >"+ d.Rect.GetX()+" font >"+ d.FontFamily + " font size >"+ d.FontSize.ToString()+" space >"+ d.SpaceWidth);**

                            }


                        }
                    }
                }



    }


class TextLocationStrategy : LocationTextExtractionStrategy
{
    public static List<TextMyChunk> objectResult = new List<TextMyChunk>();

    public class TextMyChunk
    {
        public string Text { get; set; }
        public Rectangle Rect { get; set; }
        public string FontFamily { get; set; }
        public float FontSize { get; set; }
        public float SpaceWidth { get; set; }

    }

    public override void EventOccurred(IEventData data, EventType type)
    {
        if (!type.Equals(EventType.RENDER_TEXT)) return;

        TextRenderInfo renderInfo = (TextRenderInfo)data;

        IList<TextRenderInfo> text = renderInfo.GetCharacterRenderInfos();
        foreach (TextRenderInfo t in text)
        {
            string letter = t.GetText();
            Vector letterStart = t.GetBaseline().GetStartPoint();
            Vector letterEnd = t.GetAscentLine().GetEndPoint();
            Rectangle letterRect = new Rectangle(letterStart.Get(0), letterStart.Get(1), letterEnd.Get(0) - letterStart.Get(0), letterEnd.Get(1) - letterStart.Get(1));

                TextMyChunk chunk = new TextMyChunk();
                chunk.Text = letter;
                chunk.Rect = letterRect;
                chunk.FontFamily = t.GetFont().GetFontProgram().ToString();
                chunk.FontSize = t.GetFontSize();
                chunk.SpaceWidth = t.GetSingleSpaceWidth();

                objectResult.Add(chunk);

        }

    }

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM