简体   繁体   English

PDFBox获取缩写的内容含义

[英]PDFBox get content meaning of abbreviations

I'm having problems with PDFBox, java library. 我在使用PDFBox,java库时遇到问题。 I'm trying to work on pdfs' structures and to do that without losing information I'm using PDPage.getContents() instead of a text stripper. 我正在尝试处理pdf的结构,并在不丢失信息的情况下做到这一点,我使用的是PDPage.getContents()而不是文本剥离程序。

Problem being, it displays the content with a lot of abbreviations and number and such for which I was not able to find an explanation on the website. 问题是,它显示的内容带有很多缩写和数字,而我无法在网站上找到说明。

An example: 一个例子:

BT
0.001 Tc
1.2045 TL
9.9626 0 0 9.9626 53.04069 571.90505 Tm
[(con)26.6(t)4.4(aining)-378.3(their)-378.2(a)-4.9(sso)-29(ciated)-358.9(eigen)26.6(v)59(alues)] TJ
ET
BT
0 Tc
0 TL
/F8 1 Tf
9.9626 0 0 9.9626 226.08209 571.90505 Tm
[(\012)] TJ
ET
BT
/F11 1 Tf
6.9738 0 0 6.9738 231.84 570.465 Tm
[(d)] TJ
ET
BT
0.0002 Tc
/F5 1 Tf
9.9626 0 0 9.9626 236.64 571.905 Tm
[(,)-372.5(i)0.9(n)-383.8(d)1.7(escending)-379.1(o)-5.7(r)-5.6(der)-5.6(.)-360.4(Beca)-5.7(use)-362.4(t)3.6(he)] TJ
ET
BT
-0.0008 Tc
1.2045 TL
9.9626 0 0 9.9626 53.04024 559.90505 Tm
[(co)17.4(v)57.2(a)-6.7(r)-6.6(i)-0.1(a)-6.7(n)0.7(ce)-267(ma)-6.7(tr)-6.6(ix)-280(is)-280.9(symmetr)-6.6(ic)-279.1(a)-6.7(n)0.7(d)-288.4(s)-3.8(emip)-23.4(o)-6.7(s)-3.8(itiv)21.1(e)-279.1(d)0.7(e“nite,)-289.1(t)2.6(he)-291.1(eig)-6.7(e)-2(n)24.8(v)21.1(ecto)-6.7(r)-6.6(s)-256.8(a)-6.7(r)-6.6(e)] TJ
ET

I was able to translate some of the simple obvious ones (ET = end text, BT = begin text) but basically everything else I can't be sure. 我能够翻译一些简单的显而易见的内容(例如,ET =结束文本,BT =起始文本),但基本上我无法确定所有其他内容。 The numbers next to "syllables" seem to be doing something with position. “音节”旁边的数字似乎在说明位置。

Of particular interest to me are the /F5, /F7, .. ; 我特别感兴趣的是/ F5,/ F7,..; They seem to have to do with the format of the text that comes after them, but only knowing this can't really help for general pdf analysis, I need to have a bit more information. 它们似乎与后面的文本格式有关,但是仅知道这对常规的pdf分析并没有真正的帮助,所以我需要更多信息。

I will gladly accept any piece of information which might be of use. 我很乐意接受任何可能有用的信息。 Thank you in advance :) 先感谢您 :)

The best place to start is Annex A (on the left) "operator summary" in the PDF 32000 specification , or page 645. In the beginning, I used it all the time. 最好的起点是PDF 32000规范的附件A(左侧)“操作员摘要”,或第645页。一开始,我一直都在使用它。

In your example, "Tf" is "select font". 在您的示例中,“ Tf”是“选择字体”。 To find out what the font is, look up the name in the resource dictionary with PDFDebugger, or hover the mouse cursor over "Tf" and wait for the font name to be displayed. 要找出字体是什么,请使用PDFDebugger在资源字典中查找名称,或将鼠标光标悬停在“ Tf”上并等待字体名称显示。 Here's an example: 这是一个例子:

在此处输入图片说明

So /TT2 is a Verdana,Bold font subset. 因此,/ TT2是Verdana,Bold字体子集。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM