简体   繁体   English

使用Java PDFBox库编写俄语PDF

[英]Using Java PDFBox library to write Russian PDF

I am using a Java library called PDFBox trying to write text to a PDF. 我正在使用一个名为PDFBox的Java库来尝试将文本写入PDF。 It works perfect for English text, but when i tried to write Russian text inside the PDF the letters appeared so strange. 它适用于英文文本,但当我试图在PDF中写入俄文文本时,字母显得很奇怪。 It seems the problem is in the font used, but i am not so sure about that, so i hope if anyone could guide me through this. 似乎问题在于使用的字体,但我不太确定,所以我希望有人能指导我完成这个。 Here is the important code lines : 以下是重要的代码行:

PDTrueTypeFont font = PDTrueTypeFont.loadTTF( pdfFile, new File( "fonts/VREMACCI.TTF" ) );  // Windows Russian font imported to write the Russian text.
font.setEncoding( new WinAnsiEncoding() );  // Define the Encoding used in writing.
// Some code here to open the PDF & define a new page.
contentStream.drawString( "отделом компьютерной" ); // Write the Russian text.

The WinAnsiEncoding source code is : Click here WinAnsiEncoding源代码是: 点击这里

--------------------- Edit on 18 November 2009 ---------------------编辑于2009年11月18日

After some investigation, i am now sure it is an Encoding problem, this could be solved by defining my own Encoding using the helpful PDFBox class called DictionaryEncoding . 经过一些调查,我现在确定它是一个编码问题,这可以通过使用名为DictionaryEncoding的有用的PDFBox类定义我自己的编码来解决。

I am not sure how to use it, but here is what i have tried until now : 我不知道如何使用它,但这是我迄今为止尝试过的:

COSDictionary cosDic = new COSDictionary();
cosDic.setString( COSName.getPDFName("Ercyrillic"), "0420 " ); // Russian letter.
font.setEncoding( new DictionaryEncoding( cosDic ) );

This does not work, as it seems i am filling the dictionary in a wrong way, when i write a PDF page using this it appears blank. 这不起作用,因为我似乎以错误的方式填写字典,当我使用它写一个PDF页面时,它显示为空白。

The DictionaryEncoding source code is : Click here DictionaryEncoding源代码是: 点击这里

The long story is this - in order to do unicode output in PDF from a TrueType font, the output must include a ton of detailed and seemingly superfluous information. 长话故事是这样的 - 为了从TrueType字体中进行PDF格式的unicode输出,输出必须包含大量详细且看似多余的信息。 What it comes down to is this - inside a TrueType font the glyphs are stored as glyph ids. 它归结为 - 在TrueType字体内,字形存储为字形ID。 These glyph ids are associated with a particular unicode character (and IIRC, a unicode glyph internally may refer to several code points - like é referring to e and an acute accent - my memory is hazy). 这些字形ID与特定的unicode字符相关联(而IIRC,内部的unicode字形可能指的是几个代码点 - 就像é指的是e和一个尖锐的口音 - 我的记忆是模糊的)。 PDF doesn't really have unicode support other than to say that there exists a mapping from UTF16BE values in a string to glyph ids in a TrueType font as well as a mapping from UTF16BE values to Unicode - even if it's identity. 除了说明存在从字符串中的UTF16BE值到TrueType字体中的字形ID以及从UTF16BE值到Unicode的映射(即使它的标识)之外,PDF实际上没有unicode支持。

  • a Font dictionary of Subtype Type0 with 子类型Type0的字体字典
    • a DescendantFonts array with an entry described below 一个DescendantFonts数组,其中包含如下所述的条目
    • a ToUnicode entry that maps UTF16BE values to unicode 将UTF16BE值映射到unicode的ToUnicode条目
    • an Encoding set to Identity-H 编码设置为Identity-H

Output from one of my unit tests on my own tools looks like this: 我自己的工具上的一个单元测试的输出如下所示:

13 0 obj
<< 
   /BaseFont /DejaVuSansCondensed 
   /DescendantFonts [ 4 0 R  ]   
   /ToUnicode 14 0 R 
   /Type /Font 
   /Subtype /Type0 
   /Encoding /Identity-H 
>> endobj

14 0 obj
<< /Length 346 >> stream
/CIDInit /ProcSet findresource begin 12 dict begin begincmap /CIDSystemInfo <<
/Registry (Adobe) /Ordering (UCS) /Supplement 0 >> def /CMapName /Adobe-Identity-UCS
def /CMapType 2 def 1 begincodespacerange <0000> <FFFF> endcodespacerange 1
beginbfrange <0000> <FFFF> <0000> endbfrange endcmap CMapName currentdict /CMap
defineresource pop end end

endstream % note that the formatting is wrong for the stream endstream%请注意流的格式错误

  • a Font dictionary of Subtype CIDFontTYpe2 with 子类型CIDFontTYpe2的字体字典
    • a CIDSsytemInfo 一个CIDSsytemInfo
    • a FontDescriptor 一个FontDescriptor
    • DW and W DW和W.
    • a CIDToGIDMap that maps from character ID to glyph ID 从字符ID映射到字形ID的CIDToGIDMap

Here's the one from the same test - this is the object in the DescendantFonts array: 这是来自同一测试的那个 - 这是DescendantFonts数组中的对象:

4 0 obj
<< 
   /Subtype /CIDFontType2 
   /Type /Font 
   /BaseFont /DejaVuSansCondensed 
   /CIDSystemInfo 8 0 R 
   /FontDescriptor 9 0 R 
   /DW 1000 
   /W 10 0 R 
   /CIDToGIDMap 11 0 R 
>>

8 0 obj
<< 
   /Registry (Adobe)
   /Ordering (UCS)
   /Supplement 0 
>>
endobj

Why am I telling you this? 我为什么告诉你这个? What does it have to do with PDFBox? 它与PDFBox有什么关系? Just this: Unicode output in PDF is, frankly, a royal pain in the butt. 就是这样:坦率地说,PDF格式的Unicode输出是对手的皇家痛苦。 Acrobat was developed before there was Unicode and it was painful from the start to have CJK encodings without Unicode (I know - I worked on Acrobat then). Acrobat是在有Unicode之前开发的,从一开始就很难有没有Unicode的CJK编码(我知道 - 我当时在Acrobat上工作过)。 Later Unicode support was added, but it really felt like it was glommed on. 后来增加了Unicode支持,但它确实感觉它已经被搞砸了。 One would hope that you would just say /Encoding /Unicode and have strings that start with the thorn and y-dieresis characters and off you go. 人们希望你能说/ Encoding / Unicode,并且拥有以刺和y-dieresis字符开头的字符串,然后离开。 No such luck. 没有这样的运气。 If you don't put in every detailed thing (and really, Acrobat, embedding a PostScript program to translate to Unicode? WTH?), you get a blank page in Acrobat. 如果你没有放入每一个详细的东西(实际上,Acrobat,嵌入一个PostScript程序来翻译成Unicode?WTH?),你​​会在Acrobat中得到一个空白页面。 I swear, I am not making this up. 我发誓,我不是这样做的。

At this point, I write PDF generation tools for a separate company (.NET right now, so it won't help you), and I made it a design goal to hide all that nonsense. 在这一点上,我为一家独立的公司编写了PDF生成工具(.NET现在,所以它对你没有帮助),我把它设计成隐藏所有废话的设计目标。 All text is unicode - if you only use those character codes that are the same a WinAnsi, that's what you get under the hood. 所有文本都是unicode - 如果你只使用那些与WinAnsi相同的字符代码,那就是你得到的内容。 Use anything else, you get all this other stuff with it. 使用其他任何东西,你得到所有其他东西。 I'd be surprised if PDFBox does that work for you - it is a serious hassle. 如果PDFBox能帮到你,我会感到很惊讶 - 这是一个非常麻烦的事情。

The solution is very Simple. 解决方案非常简单。

1) You must find fonts compatible with the characters you want to display. 1)您必须找到与要显示的字符兼容的字体。
2) Download locally the .ttf file of the fonts. 2)在本地下载字体的.ttf文件。
3) Load fonts from your application 3)从您的应用程序加载字体

For Example this is what you have to do in case you want to use Greek characters: 例如,如果您想使用希腊字符,则必须执行此操作:

content = new PDPageContentStream(document, page);
pdfFont = PDType0Font.load( document, new File( "arialuni.ttf" ) )
content.setFont(pdfFont, fontSize);

Perhaps the Russian encoding class need to be written, it should look like the WinAnsiEncoding one, I suppose. 也许俄语编码类需要编写,我想它应该看起来像WinAnsiEncoding
Now, I have no idea what to put there! 现在,我不知道该放什么!

Or, if that's not what you do already, perhaps you should encode your source file in UTF-8 and use a default encoding. 或者,如果那不是你已经做过的,也许你应该用UTF-8编码你的源文件并使用默认编码。
I saw some messages related to issues with extracting Russian text from existing PDF files (using PDFBox of course) but I don't know if output is related. 我看到一些消息与从现有PDF文件中提取俄语文本的问题有关(当然使用PDFBox),但我不知道输出是否相关。
You can also write to the PDFBox mailing list. 您也可以写入PDFBox邮件列表。

Testing whether this is an encoding issue should be pretty easy to do (just switch to UTF16 encoding). 测试这是否是一个编码问题应该很容易做到(只需切换到UTF16编码)。

I'm assuming that you've tried using an editor or something with the VREMACCI font and confirmed that it displays the way you expect it to? 我假设您尝试使用编辑器或VREMACCI字体,并确认它显示您期望的方式?

You might want to try doing the same thing in iText just to get a feel for whether the issue is related to the PdfBox library itself... If your primary goal is to generate PDF files, iText might be a better solution anyway. 您可能想尝试在iText中执行相同的操作,只是为了了解问题是否与PdfBox库本身有关...如果您的主要目标是生成PDF文件,那么iText可能是更好的解决方案。

EDIT - long answer to comments: 编辑 - 评论的长回答:

ok - sorry for the back and forth on the encoding question... Your core issue (which you probably already knew) is that the encoding of the bytes being written to the content stream is different than the encoding being used to look up glyphs. 好的 - 抱歉在编码问题上来回...你的核心问题(你可能已经知道)是写入内容流的字节的编码与用于查找字形的编码不同。 Now I'll try to actually be helpful: 现在我会尝试实际上有所帮助:

I took a look at the dictionary encoding class in PdfBox, and it looks quite unintuitive... The 'dictionary' in question is a PDF dictionary. 我看了一下PdfBox中的字典编码类,它看起来很不直观......有问题的'字典'是一本PDF字典。 So what you'll basically need to do is create a Pdf dictionary object (I think that PdfBox calls this a type of COSObject), then add entries to it. 所以你基本上需要做的是创建一个Pdf字典对象(我认为PdfBox称之为一种COSObject),然后添加条目。

The encoding for a font is defined in PDF as a dictionary (see page 266 of the above spec). 字体的编码在PDF中定义为字典(参见上述规范的第266页)。 The dictionary contains a base encoding name, plus an optional differences array. 该字典包含基本编码名称和可选差异数组。 Technically, the differences array should not be used with true-type fonts (although I've seen it used in some cases - don't use it, though). 从技术上讲,差异数组不应该与真实字体一起使用(虽然我已经看到它在某些情况下使用 - 但不要使用它)。

You will then specify an entry for the cmap for the encoding. 然后,您将为cmap指定编码条目。 This cmap will be the encoding of your font. 此cmap将是您的字体的编码。

My suggestion here is to take an existing PDF that does what you want, then get a dump of the dictionary structure for the font so you can see what it looks like. 我的建议是采用现有的PDF来做你想要的,然后获取字体的字典结构的转储,这样你就可以看到它的样子。

This is definitely not for the faint of heart. 这绝对不适合胆小的人。 I can provide some help - if you need a dictionary dump, shoot me a hyperlink with a sample PDF and I'll run it through some of the algorithms I use in my iText development (I'm the maintainer of the iText text extraction sub-system). 我可以提供一些帮助 - 如果你需要一个字典转储,给我一个带有示例PDF的超链接,我将通过我在iText开发中使用的一些算法运行它(我是iText文本提取子的维护者) -系统)。

EDIT - 11/17/09 编辑 - 11/17/09

OK - here's the dictionary dump from the russian.pdf file (sub-dictionaries are listed indented, and in the order they appeared in the containing dictionary): 好的 - 这是来自russian.pdf文件的字典转储(子字典列出缩进,按照它们出现在包含字典中的顺序):

(/CropBox=[0, 0, 595, 842], /Parent=Dictionary of type: /Pages, /Type=/Page, /Contents=[209 0 R, 210 0 R, 211 0 R, 214 0 R, 215 0 R, 216 0 R, 222 0 R, 223 0 R], /Resources=Dictionary, /MediaBox=[0, 0, 595, 842], /StructParents=0, /Rotate=0)
    Subdictionary /Parent = (/Type=/Pages, /Count=6, /Kids=[195 0 R, 1 0 R, 3 0 R, 5 0 R, 7 0 R, 9 0 R])
    Subdictionary /Resources = (/ExtGState=Dictionary, /ProcSet=[/PDF, /Text], /ColorSpace=Dictionary, /Font=Dictionary, /Properties=Dictionary)
        Subdictionary /ExtGState = (/GS0=Dictionary of type: /ExtGState)
            Subdictionary /GS0 = (/OPM=1, /op=false, /Type=/ExtGState, /SA=false, /OP=false, /SM=0.02)
        Subdictionary /ColorSpace = (/CS0=[/ICCBased, 228 0 R])
        Subdictionary /Font = (/C2_1=Dictionary of type: /Font, /C2_2=Dictionary of type: /Font, /C2_3=Dictionary of type: /Font, /C2_4=Dictionary of type: /Font, /TT2=Dictionary of type: /Font, /TT1=Dictionary of type: /Font, /TT0=Dictionary of type: /Font, /C2_0=Dictionary of type: /Font, /TT3=Dictionary of type: /Font)
            Subdictionary /C2_1 = (/DescendantFonts=[243 0 R], /BaseFont=/LDMIEC+TimesNewRomanPS-BoldMT, /Type=/Font, /Subtype=/Type0, /Encoding=/Identity-H, /ToUnicode=Stream)
            Subdictionary /C2_2 = (/DescendantFonts=[233 0 R], /BaseFont=/LDMIBO+TimesNewRomanPSMT, /Type=/Font, /Subtype=/Type0, /Encoding=/Identity-H, /ToUnicode=Stream)
            Subdictionary /C2_3 = (/DescendantFonts=[224 0 R], /BaseFont=/LDMIHD+TimesNewRomanPS-ItalicMT, /Type=/Font, /Subtype=/Type0, /Encoding=/Identity-H, /ToUnicode=Stream)
            Subdictionary /C2_4 = (/DescendantFonts=[229 0 R], /BaseFont=/LDMIDA+Tahoma, /Type=/Font, /Subtype=/Type0, /Encoding=/Identity-H, /ToUnicode=Stream)
            Subdictionary /TT2 = (/LastChar=58, /BaseFont=/LDMIFC+TimesNewRomanPS-BoldMT, /Type=/Font, /Subtype=/TrueType, /Encoding=/WinAnsiEncoding, /Widths=[250, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 250, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 333], /FontDescriptor=Dictionary of type: /FontDescriptor, /FirstChar=32)
                Subdictionary /FontDescriptor = (/Type=/FontDescriptor, /StemV=136, /Descent=-216, /FontWeight=700, /FontBBox=[-558, -307, 2000, 1026], /CapHeight=656, /FontFile2=Stream, /FontStretch=/Normal, /Flags=34, /XHeight=0, /FontFamily=Times New Roman, /FontName=/LDMIFC+TimesNewRomanPS-BoldMT, /Ascent=891, /ItalicAngle=0)
            Subdictionary /TT1 = (/LastChar=187, /BaseFont=/LDMICP+TimesNewRomanPSMT, /Type=/Font, /Subtype=/TrueType, /Encoding=/WinAnsiEncoding, /Widths=[250, 0, 0, 0, 0, 833, 778, 0, 333, 333, 0, 0, 250, 333, 250, 278, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 278, 278, 0, 564, 0, 444, 0, 722, 667, 667, 722, 611, 556, 0, 722, 333, 389, 0, 611, 889, 722, 722, 556, 0, 667, 556, 611, 0, 722, 944, 0, 722, 0, 333, 0, 333, 0, 500, 0, 444, 500, 444, 500, 444, 333, 500, 500, 278, 0, 500, 278, 778, 500, 500, 500, 0, 333, 389, 278, 500, 500, 722, 0, 500, 444, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 500, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 500, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 500], /FontDescriptor=Dictionary of type: /FontDescriptor, /FirstChar=32)
                Subdictionary /FontDescriptor = (/Type=/FontDescriptor, /StemV=82, /Descent=-216, /FontWeight=400, /FontBBox=[-568, -307, 2000, 1007], /CapHeight=656, /FontFile2=Stream, /FontStretch=/Normal, /Flags=34, /XHeight=0, /FontFamily=Times New Roman, /FontName=/LDMICP+TimesNewRomanPSMT, /Ascent=891, /ItalicAngle=0)
            Subdictionary /TT0 = (/LastChar=55, /BaseFont=/LDMIBN+TimesNewRomanPS-BoldItalicMT, /Type=/Font, /Subtype=/TrueType, /Encoding=/WinAnsiEncoding, /Widths=[250, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 250, 0, 500, 500, 500, 0, 0, 0, 0, 500], /FontDescriptor=Dictionary of type: /FontDescriptor, /FirstChar=32)
                Subdictionary /FontDescriptor = (/Type=/FontDescriptor, /StemV=116.867004, /Descent=-216, /FontWeight=700, /FontBBox=[-547, -307, 1206, 1032], /CapHeight=656, /FontFile2=Stream, /FontStretch=/Normal, /Flags=98, /XHeight=468, /FontFamily=Times New Roman, /FontName=/LDMIBN+TimesNewRomanPS-BoldItalicMT, /Ascent=891, /ItalicAngle=-15)
            Subdictionary /C2_0 = (/DescendantFonts=[238 0 R], /BaseFont=/LDMHPN+TimesNewRomanPS-BoldItalicMT, /Type=/Font, /Subtype=/Type0, /Encoding=/Identity-H, /ToUnicode=Stream)
            Subdictionary /TT3 = (/LastChar=169, /BaseFont=/LDMIEB+Tahoma, /Type=/Font, /Subtype=/TrueType, /Encoding=/WinAnsiEncoding, /Widths=[313, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 546, 0, 546, 0, 0, 546, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 929], /FontDescriptor=Dictionary of type: /FontDescriptor, /FirstChar=32)
                Subdictionary /FontDescriptor = (/Type=/FontDescriptor, /StemV=92, /Descent=-206, /FontWeight=400, /FontBBox=[-600, -208, 1338, 1034], /CapHeight=734, /FontFile2=Stream, /FontStretch=/Normal, /Flags=32, /XHeight=546, /FontFamily=Tahoma, /FontName=/LDMIEB+Tahoma, /Ascent=1000, /ItalicAngle=0)
        Subdictionary /Properties = (/MC0=Dictionary of type: /OCMD)
            Subdictionary /MC0 = (/Type=/OCMD, /OCGs=Dictionary of type: /OCG)
                Subdictionary /OCGs = (/Usage=Dictionary, /Type=/OCG, /Name=HeaderFooter)
                    Subdictionary /Usage = (/CreatorInfo=Dictionary, /PageElement=Dictionary)
                        Subdictionary /CreatorInfo = (/Creator=Acrobat PDFMaker 6.0 äëÿ Word)
                        Subdictionary /PageElement = (/SubType=/HF)

there's a lot of moving parts here. 这里有很多活动部件。 you might want to put together a test document that has only 3 or 4 characters in the font in question... There are a lot of type-1 fonts being used here (in addition to the TT fonts), so it's hard to tell what is involved in your particular issue. 你可能想把一个只有3或4个字符的测试文档组合在一起...这里使用了很多类型1字体(除了TT字体之外),所以很难分辨您特定问题涉及的内容。

(Are you sure you don't want to at least try this with iText? ;-) I'm not saying that it'll work, just that it might be worth a shot ). (你确定你不想至少尝试使用iText吗?;-)我不是说它会起作用,只是它可能值得一试)。

For reference, the above dictionary dump was obtained using the com.lowagie.text.pdf.parser.PdfContentReaderTool class 作为参考,使用com.lowagie.text.pdf.parser.PdfContentReaderTool类获得上面的字典转储

Try to use this construction: 尝试使用这种结构:

PDFont font = PDType0Font.load( pdfFile, new File( "fonts/VREMACCI.TTF" ) );  // Windows Russian font imported to write the Russian text.
// Some code here to open the PDF & define a new page.
contentStream.beginText();
contentStream.setFont(font, 12);
contentStream.showText( "отделом компьютерной" ); // Write the Russian text.
contentStream.endText();

Just try this one: 试试这个:

Phrase leftTitle = new Phrase("САНКТ-ПЕТЕРБУРГ", FontFactory.getFont("Tahoma", "Cp1251", true, 25)); 短语leftTitle = new Phrase(“САНКТ-ПЕТЕРБУРГ”,FontFactory.getFont(“Tahoma”,“Cp1251”,true,25));

This will work at least with latest (5.0.1) iText 这至少适用于最新的(5.0.1)iText

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM