简体   繁体   English

使用PDFBox的PDF文本的字体信息

[英]Font information of text in PDF using PDFBox

I am new to Apache PDFBox library. 我是Apache PDFBox库的新手。

I want to map font information to the PDF paragraphs 我想将字体信息映射到PDF段落

I have already gone through Questios How to extract font styles of text contents using pdfbox? 我已经经历过Questios 如何使用pdfbox提取文本内容的字体样式?

But it doesn't give information about which paragraph is written in which font. 但是它没有提供有关哪个段落以哪种字体编写的信息。

for example,if my page contains text: 例如,如果我的页面包含文本:

para1:Arial PARA1:宋体

para2:Times New Roman 段落2:时代新罗马

Then i should be able to get the information that para1 is written in Arial while para2 is written in Times New Roman. 然后我应该能够获得para1用Arial编写而para2用Times New Roman编写的信息。

Solution proposed in above question gives the information that the PDF page contains only 上述问题中提出的解决方案给出了PDF页面仅包含的信息

arial and times new roman . 阿里亚尔和时代新罗马。

The PDFTextStripper class you use is documented (cf. its JavaDoc comment) like this: 您使用的PDFTextStripper类的文档记录如下(参见其JavaDoc注释):

* This class will take a pdf document and strip out all of the text and ignore the
* formatting and such.

To get specific font information, therefore, you have to change it somewhat. 因此,要获取特定的字体信息,您必须对其进行一些更改。

The font information is available in this class all along and only discarded when outputting a line, have a look at its source : 字体信息始终在此类中可用,并且仅在输出行时丢弃,请查看其来源

protected void writePage() throws IOException
{
    [...]
    for( int i = 0; i < charactersByArticle.size(); i++)
    {
        [...]
        List<TextPosition> line = new ArrayList<TextPosition>();
        [...]
        while( textIter.hasNext() )
        {
            [...]
            if( lastPosition != null )
            {
                [...]
                if(!overlap(positionY, positionHeight, maxYForLine, maxHeightForLine))
                {
                    writeLine(normalize(line,isRtlDominant,hasRtl),isRtlDominant);
                    line.clear();
                    [...]
                }
............

The TextPosition instances in that list line still have all formatting information available, among them the font used, only while "normalizing" line it is reduced to pure characters. 该列表lineTextPosition实例仍然具有所有可用的格式设置信息,其中包括所使用的字体,仅在“规范化” line其简化为纯字符。

To keep font information, therefore, you have different options, depending on how you want to retrieve the font information: 因此,要保留字体信息,您可以根据检索字体信息的方式选择不同的选项:

  • If you want to continue retrieving all page content information (including fonts) in a single String via getText : You change the method 如果您想继续通过getText在单个String中检索所有页面内容信息(包括字体):您可以更改方法

     private List<String> normalize(List<TextPosition> line, boolean isRtlDominant, boolean hasRtl) 

    to include some font tags (eg [Arial] ) of your choice whenever the font changes. 包含每当字体更改时您选择的一些字体标签(例如[Arial] )。 Unfortunately this method is private. 不幸的是,这种方法是私有的。 Thus, you have to copy the whole PDFTextStripper class and change the code of the copy. 因此,您必须复制整个PDFTextStripper类并更改复制的代码。

  • If you want to retrieve the specificfont information in a different structure (eg as List<List<TextPosition>> ) you can derive your own stripper class from PDFTextStripper , add some variable of your desired type, and override the protected method writePage mentioned above, copying it and only enhancing it right before or after the line 如果您想以其他结构检索特定List<List<TextPosition>>信息(例如List<List<TextPosition>> ),则可以从PDFTextStripper派生自己的剥离器类,添加所需类型的变量,并覆盖上述protected方法writePage ,复制它,仅在行之前或之后进行增强

     writeLine(normalize(line,isRtlDominant,hasRtl),isRtlDominant); 

    with code adding the information to your new variable. 使用将信息添加到新变量的代码。 Eg 例如

     public class MyPDFTextStripper extends PDFTextStripper { public List<List<TextPosition>> myLines = new ArrayList<List<TextPosition>>(); [...] if(!overlap(positionY, positionHeight, maxYForLine, maxHeightForLine)) { writeLine(normalize(line,isRtlDominant,hasRtl),isRtlDominant); myLines.add(new ArrayList<TextPosition>(line)); line.clear(); [...] } 

    Now you can call getText for an instance of your MyPDFTextStripper , retrieve the plain text as result, and access the additional data via the new variable 现在,您可以为MyPDFTextStripper的实例调用getText ,检索纯文本作为结果,并通过新变量访问其他数据

要添加库字体以外的其他字体,因此您需要专门添加字体文件。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM