如何从PDF获取文本的字体名称？

Question

我正在寻找提取PDF文件中文本的所有不同字体名称。 我正在使用iTextSharp DLL，下面是我的代码。

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using iTextSharp.text.pdf.parser;
using iTextSharp.text.pdf;

namespace GetFontName
{
    class Program
    {
        static void Main(string[] args)
        {
            PdfReader reader = new PdfReader("C:/Users/agnihotri/Downloads/Test.pdf");
            HashSet<String> names = new HashSet<string>();
            PdfDictionary resources;
            for (int p = 1; p <= reader.NumberOfPages; p++)
            {
                PdfDictionary dic = reader.GetPageN(p);
                resources = dic.GetAsDict(PdfName.RESOURCES);
                if (resources != null)
                {
                    //gets fonts dictionary
                    PdfDictionary fonts = resources.GetAsDict(PdfName.FONT);
                    if (fonts != null)
                    {

                        PdfDictionary font;

                        foreach (PdfName key in fonts.Keys)
                        {
                        font = fonts.GetAsDict(key);
                        string name = font.GetAsName(iTextSharp.text.pdf.PdfName.BASEFONT).ToString();

                            //check for prefix subsetted font

                        if (name.Length > 8 && name.ToCharArray()[7] == '+')
                        {
                        name = String.Format("%s subset (%s)", name.Substring(8), name.Substring(1, 7));

                        }
                        else
                        {
                                //get type of fully embedded fonts
                        name = name.Substring(1);
                        PdfDictionary desc = font.GetAsDict(PdfName.FONTDESCRIPTOR);
                        if (desc == null)
                        name += "no font descriptor";
                        else if (desc.Get(PdfName.FONTFILE) != null)
                        name += "(Type1) embedded";
                        else if (desc.Get(PdfName.FONTFILE2) != null)
                        name += "(TrueType) embedded ";
                        else if (desc.Get(PdfName.FONTFILE3) != null)
                        name += name;//("+font.GetASName(PdfName.SUBTYPE).ToString().SubSTring(1)+")embedded';
                        }

                        names.Add(name);
                        }
                    }
                }
            }
            var collections = from name in names
            select name;
            foreach (string fname in collections)
            {
            Console.WriteLine(fname);
            }
            Console.Read();

        }
    }
}

我得到的输出是每个输入pdf文件的“无字形字体”，没有字体描述符”，输入文件的链接如下：

https://drive.google.com/open?id=0B6tD8gqVZtLiM3NYMmVVVllNcWc

Answer 1

我已经在Adobe Acrobat中打开了PDF，然后查看了字体面板。 这是我看到的：

您具有LiberationMono的嵌入式子集，这意味着字体名称将以ABCDEF + LiberationMono（其中ABCDEF是由6个随机但唯一的字符组成的序列）存储在文件中，因为字体是子集。 请参阅我的PDF字体名称中有哪些多余的字符？

现在，让我们看一下在iText RUPS中打开的相同文件：

我们找到了/Font对象，它有一个/FontDescriptor 。 在/FontDescriptor ，我们以期望的格式找到/FontName ： BAAAAA+LiberationMono 。

现在您知道该名称的查找位置，可以修改代码了。

Answer 2

以最小的更改运行代码，我得到的输出

%s subset (%s)

实际上， %s看起来像Java格式的字符串，而不是.Net格式的字符串。 使用更多.Net'ish格式的字符串{0} subset ({1})我得到

LiberationMono subset (BAAAAA+)

我建议您使用反斜杠和@"..."字符串形式，而不是在文件路径中使用斜杠，例如：

PdfReader reader = new PdfReader(@"C:\Users\agnihotri\Downloads\Test.pdf");

并仔细检查文件名和路径---在您提供的所有文件都命名为Hello_World.pdf 。

如何从PDF获取文本的字体名称？

问题描述

2 个解决方案

解决方案1
2 2016-06-14 14:13:23

解决方案2
2 2016-06-14 14:30:21

如何从PDF获取文本的字体名称？

问题描述

2 个解决方案

解决方案1 2 2016-06-14 14:13:23

解决方案2 2 2016-06-14 14:30:21

解决方案1
2 2016-06-14 14:13:23

解决方案2
2 2016-06-14 14:30:21