简体   繁体   中英

How to get the font name of Text from a PDF?

I am looking to extract all different font names of the text in PDF file. I am using iTextSharp DLL, and below given is my code.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using iTextSharp.text.pdf.parser;
using iTextSharp.text.pdf;

namespace GetFontName
{
    class Program
    {
        static void Main(string[] args)
        {
            PdfReader reader = new PdfReader("C:/Users/agnihotri/Downloads/Test.pdf");
            HashSet<String> names = new HashSet<string>();
            PdfDictionary resources;
            for (int p = 1; p <= reader.NumberOfPages; p++)
            {
                PdfDictionary dic = reader.GetPageN(p);
                resources = dic.GetAsDict(PdfName.RESOURCES);
                if (resources != null)
                {
                    //gets fonts dictionary
                    PdfDictionary fonts = resources.GetAsDict(PdfName.FONT);
                    if (fonts != null)
                    {

                        PdfDictionary font;

                        foreach (PdfName key in fonts.Keys)
                        {
                        font = fonts.GetAsDict(key);
                        string name = font.GetAsName(iTextSharp.text.pdf.PdfName.BASEFONT).ToString();

                            //check for prefix subsetted font

                        if (name.Length > 8 && name.ToCharArray()[7] == '+')
                        {
                        name = String.Format("%s subset (%s)", name.Substring(8), name.Substring(1, 7));

                        }
                        else
                        {
                                //get type of fully embedded fonts
                        name = name.Substring(1);
                        PdfDictionary desc = font.GetAsDict(PdfName.FONTDESCRIPTOR);
                        if (desc == null)
                        name += "no font descriptor";
                        else if (desc.Get(PdfName.FONTFILE) != null)
                        name += "(Type1) embedded";
                        else if (desc.Get(PdfName.FONTFILE2) != null)
                        name += "(TrueType) embedded ";
                        else if (desc.Get(PdfName.FONTFILE3) != null)
                        name += name;//("+font.GetASName(PdfName.SUBTYPE).ToString().SubSTring(1)+")embedded';
                        }

                        names.Add(name);
                        }
                    }
                }
            }
            var collections = from name in names
            select name;
            foreach (string fname in collections)
            {
            Console.WriteLine(fname);
            }
            Console.Read();

        }
    }
}

The output I am getting is "Glyphless Font" no font descriptor" for every pdf file as input. The link for input file is as follows:

https://drive.google.com/open?id=0B6tD8gqVZtLiM3NYMmVVVllNcWc

I've opened your PDF in Adobe Acrobat and I look at the font panel. This is what I saw:

在此处输入图片说明

You have an embedded SubSet of LiberationMono, which means that the name of the font will be stored in the file as ABCDEF+LiberationMono (where ABCDEF is a series of 6 random, but unique characters) because the font is subsetter. See What are the extra characters in the font name of my PDF?

Now let's take a look at the same file opened in iText RUPS:

在此处输入图片说明

We find the /Font object and it has a /FontDescriptor . In the /FontDescriptor , we find the /FontName in the format we expected: BAAAAA+LiberationMono .

Now that you know where to look for that name, you can adapt your code.

Running your code with minimal changes I get as output

%s subset (%s)

Actually %s looks like a Java format string, not a .Net format string. Using the more .Net'ish format string {0} subset ({1}) I get

LiberationMono subset (BAAAAA+)

I would propose you use backslashes and the @"..." string form instead of slashes in a file path, eg like this

PdfReader reader = new PdfReader(@"C:\Users\agnihotri\Downloads\Test.pdf");

and double check the file name and path --- after all the file you provided is named Hello_World.pdf .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM