使用Java PDFBox庫編寫俄語PDF

Question

我正在使用一個名為PDFBox的Java庫來嘗試將文本寫入PDF。 它適用於英文文本，但當我試圖在PDF中寫入俄文文本時，字母顯得很奇怪。 似乎問題在於使用的字體，但我不太確定，所以我希望有人能指導我完成這個。 以下是重要的代碼行：

PDTrueTypeFont font = PDTrueTypeFont.loadTTF( pdfFile, new File( "fonts/VREMACCI.TTF" ) );  // Windows Russian font imported to write the Russian text.
font.setEncoding( new WinAnsiEncoding() );  // Define the Encoding used in writing.
// Some code here to open the PDF & define a new page.
contentStream.drawString( "отделом компьютерной" ); // Write the Russian text.

WinAnsiEncoding源代碼是：點擊這里

---------------------編輯於2009年11月18日

經過一些調查，我現在確定它是一個編碼問題，這可以通過使用名為DictionaryEncoding的有用的PDFBox類定義我自己的編碼來解決。

我不知道如何使用它，但這是我迄今為止嘗試過的：

COSDictionary cosDic = new COSDictionary();
cosDic.setString( COSName.getPDFName("Ercyrillic"), "0420 " ); // Russian letter.
font.setEncoding( new DictionaryEncoding( cosDic ) );

這不起作用，因為我似乎以錯誤的方式填寫字典，當我使用它寫一個PDF頁面時，它顯示為空白。

DictionaryEncoding源代碼是：點擊這里

Answer 1

長話故事是這樣的 - 為了從TrueType字體中進行PDF格式的unicode輸出，輸出必須包含大量詳細且看似多余的信息。 它歸結為 - 在TrueType字體內，字形存儲為字形ID。 這些字形ID與特定的unicode字符相關聯（而IIRC，內部的unicode字形可能指的是幾個代碼點 - 就像é指的是e和一個尖銳的口音 - 我的記憶是模糊的）。 除了說明存在從字符串中的UTF16BE值到TrueType字體中的字形ID以及從UTF16BE值到Unicode的映射（即使它的標識）之外，PDF實際上沒有unicode支持。

子類型Type0的字體字典
- 一個DescendantFonts數組，其中包含如下所述的條目
- 將UTF16BE值映射到unicode的ToUnicode條目
- 編碼設置為Identity-H

我自己的工具上的一個單元測試的輸出如下所示：

13 0 obj
<< 
   /BaseFont /DejaVuSansCondensed 
   /DescendantFonts [ 4 0 R  ]   
   /ToUnicode 14 0 R 
   /Type /Font 
   /Subtype /Type0 
   /Encoding /Identity-H 
>> endobj

14 0 obj
<< /Length 346 >> stream
/CIDInit /ProcSet findresource begin 12 dict begin begincmap /CIDSystemInfo <<
/Registry (Adobe) /Ordering (UCS) /Supplement 0 >> def /CMapName /Adobe-Identity-UCS
def /CMapType 2 def 1 begincodespacerange <0000> <FFFF> endcodespacerange 1
beginbfrange <0000> <FFFF> <0000> endbfrange endcmap CMapName currentdict /CMap
defineresource pop end end

endstream％請注意流的格式錯誤

子類型CIDFontTYpe2的字體字典
- 一個CIDSsytemInfo
- 一個FontDescriptor
- DW和W.
- 從字符ID映射到字形ID的CIDToGIDMap

這是來自同一測試的那個 - 這是DescendantFonts數組中的對象：

4 0 obj
<< 
   /Subtype /CIDFontType2 
   /Type /Font 
   /BaseFont /DejaVuSansCondensed 
   /CIDSystemInfo 8 0 R 
   /FontDescriptor 9 0 R 
   /DW 1000 
   /W 10 0 R 
   /CIDToGIDMap 11 0 R 
>>

8 0 obj
<< 
   /Registry (Adobe)
   /Ordering (UCS)
   /Supplement 0 
>>
endobj

我為什么告訴你這個？ 它與PDFBox有什么關系？ 就是這樣：坦率地說，PDF格式的Unicode輸出是對手的皇家痛苦。 Acrobat是在有Unicode之前開發的，從一開始就很難有沒有Unicode的CJK編碼（我知道 - 我當時在Acrobat上工作過）。 后來增加了Unicode支持，但它確實感覺它已經被搞砸了。 人們希望你能說/ Encoding / Unicode，並且擁有以刺和y-dieresis字符開頭的字符串，然后離開。 沒有這樣的運氣。 如果你沒有放入每一個詳細的東西（實際上，Acrobat，嵌入一個PostScript程序來翻譯成Unicode？WTH？），你會在Acrobat中得到一個空白頁面。 我發誓，我不是這樣做的。

在這一點上，我為一家獨立的公司編寫了PDF生成工具（.NET現在，所以它對你沒有幫助），我把它設計成隱藏所有廢話的設計目標。 所有文本都是unicode - 如果你只使用那些與WinAnsi相同的字符代碼，那就是你得到的內容。 使用其他任何東西，你得到所有其他東西。 如果PDFBox能幫到你，我會感到很驚訝 - 這是一個非常麻煩的事情。

Answer 2

解決方案非常簡單。

1）您必須找到與要顯示的字符兼容的字體。
2）在本地下載字體的.ttf文件。
3）從您的應用程序加載字體

例如，如果您想使用希臘字符，則必須執行此操作：

content = new PDPageContentStream(document, page);
pdfFont = PDType0Font.load( document, new File( "arialuni.ttf" ) )
content.setFont(pdfFont, fontSize);

Answer 3

也許俄語編碼類需要編寫，我想它應該看起來像WinAnsiEncoding 。
現在，我不知道該放什么！

或者，如果那不是你已經做過的，也許你應該用UTF-8編碼你的源文件並使用默認編碼。
我看到一些消息與從現有PDF文件中提取俄語文本的問題有關（當然使用PDFBox），但我不知道輸出是否相關。
您也可以寫入PDFBox郵件列表。

Answer 4

測試這是否是一個編碼問題應該很容易做到（只需切換到UTF16編碼）。

我假設您嘗試使用編輯器或VREMACCI字體，並確認它顯示您期望的方式？

您可能想嘗試在iText中執行相同的操作，只是為了了解問題是否與PdfBox庫本身有關...如果您的主要目標是生成PDF文件，那么iText可能是更好的解決方案。

編輯 - 評論的長回答：

好的 - 抱歉在編碼問題上來回...你的核心問題（你可能已經知道）是寫入內容流的字節的編碼與用於查找字形的編碼不同。 現在我會嘗試實際上有所幫助：

我看了一下PdfBox中的字典編碼類，它看起來很不直觀......有問題的'字典'是一本PDF字典。 所以你基本上需要做的是創建一個Pdf字典對象（我認為PdfBox稱之為一種COSObject），然后添加條目。

字體的編碼在PDF中定義為字典（參見上述規范的第266頁）。 該字典包含基本編碼名稱和可選差異數組。 從技術上講，差異數組不應該與真實字體一起使用（雖然我已經看到它在某些情況下使用 - 但不要使用它）。

然后，您將為cmap指定編碼條目。 此cmap將是您的字體的編碼。

我的建議是采用現有的PDF來做你想要的，然后獲取字體的字典結構的轉儲，這樣你就可以看到它的樣子。

這絕對不適合膽小的人。 我可以提供一些幫助 - 如果你需要一個字典轉儲，給我一個帶有示例PDF的超鏈接，我將通過我在iText開發中使用的一些算法運行它（我是iText文本提取子的維護者） -系統）。

編輯 - 11/17/09

好的 - 這是來自russian.pdf文件的字典轉儲（子字典列出縮進，按照它們出現在包含字典中的順序）：

(/CropBox=[0, 0, 595, 842], /Parent=Dictionary of type: /Pages, /Type=/Page, /Contents=[209 0 R, 210 0 R, 211 0 R, 214 0 R, 215 0 R, 216 0 R, 222 0 R, 223 0 R], /Resources=Dictionary, /MediaBox=[0, 0, 595, 842], /StructParents=0, /Rotate=0)
    Subdictionary /Parent = (/Type=/Pages, /Count=6, /Kids=[195 0 R, 1 0 R, 3 0 R, 5 0 R, 7 0 R, 9 0 R])
    Subdictionary /Resources = (/ExtGState=Dictionary, /ProcSet=[/PDF, /Text], /ColorSpace=Dictionary, /Font=Dictionary, /Properties=Dictionary)
        Subdictionary /ExtGState = (/GS0=Dictionary of type: /ExtGState)
            Subdictionary /GS0 = (/OPM=1, /op=false, /Type=/ExtGState, /SA=false, /OP=false, /SM=0.02)
        Subdictionary /ColorSpace = (/CS0=[/ICCBased, 228 0 R])
        Subdictionary /Font = (/C2_1=Dictionary of type: /Font, /C2_2=Dictionary of type: /Font, /C2_3=Dictionary of type: /Font, /C2_4=Dictionary of type: /Font, /TT2=Dictionary of type: /Font, /TT1=Dictionary of type: /Font, /TT0=Dictionary of type: /Font, /C2_0=Dictionary of type: /Font, /TT3=Dictionary of type: /Font)
            Subdictionary /C2_1 = (/DescendantFonts=[243 0 R], /BaseFont=/LDMIEC+TimesNewRomanPS-BoldMT, /Type=/Font, /Subtype=/Type0, /Encoding=/Identity-H, /ToUnicode=Stream)
            Subdictionary /C2_2 = (/DescendantFonts=[233 0 R], /BaseFont=/LDMIBO+TimesNewRomanPSMT, /Type=/Font, /Subtype=/Type0, /Encoding=/Identity-H, /ToUnicode=Stream)
            Subdictionary /C2_3 = (/DescendantFonts=[224 0 R], /BaseFont=/LDMIHD+TimesNewRomanPS-ItalicMT, /Type=/Font, /Subtype=/Type0, /Encoding=/Identity-H, /ToUnicode=Stream)
            Subdictionary /C2_4 = (/DescendantFonts=[229 0 R], /BaseFont=/LDMIDA+Tahoma, /Type=/Font, /Subtype=/Type0, /Encoding=/Identity-H, /ToUnicode=Stream)
            Subdictionary /TT2 = (/LastChar=58, /BaseFont=/LDMIFC+TimesNewRomanPS-BoldMT, /Type=/Font, /Subtype=/TrueType, /Encoding=/WinAnsiEncoding, /Widths=[250, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 250, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 333], /FontDescriptor=Dictionary of type: /FontDescriptor, /FirstChar=32)
                Subdictionary /FontDescriptor = (/Type=/FontDescriptor, /StemV=136, /Descent=-216, /FontWeight=700, /FontBBox=[-558, -307, 2000, 1026], /CapHeight=656, /FontFile2=Stream, /FontStretch=/Normal, /Flags=34, /XHeight=0, /FontFamily=Times New Roman, /FontName=/LDMIFC+TimesNewRomanPS-BoldMT, /Ascent=891, /ItalicAngle=0)
            Subdictionary /TT1 = (/LastChar=187, /BaseFont=/LDMICP+TimesNewRomanPSMT, /Type=/Font, /Subtype=/TrueType, /Encoding=/WinAnsiEncoding, /Widths=[250, 0, 0, 0, 0, 833, 778, 0, 333, 333, 0, 0, 250, 333, 250, 278, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 278, 278, 0, 564, 0, 444, 0, 722, 667, 667, 722, 611, 556, 0, 722, 333, 389, 0, 611, 889, 722, 722, 556, 0, 667, 556, 611, 0, 722, 944, 0, 722, 0, 333, 0, 333, 0, 500, 0, 444, 500, 444, 500, 444, 333, 500, 500, 278, 0, 500, 278, 778, 500, 500, 500, 0, 333, 389, 278, 500, 500, 722, 0, 500, 444, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 500, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 500, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 500], /FontDescriptor=Dictionary of type: /FontDescriptor, /FirstChar=32)
                Subdictionary /FontDescriptor = (/Type=/FontDescriptor, /StemV=82, /Descent=-216, /FontWeight=400, /FontBBox=[-568, -307, 2000, 1007], /CapHeight=656, /FontFile2=Stream, /FontStretch=/Normal, /Flags=34, /XHeight=0, /FontFamily=Times New Roman, /FontName=/LDMICP+TimesNewRomanPSMT, /Ascent=891, /ItalicAngle=0)
            Subdictionary /TT0 = (/LastChar=55, /BaseFont=/LDMIBN+TimesNewRomanPS-BoldItalicMT, /Type=/Font, /Subtype=/TrueType, /Encoding=/WinAnsiEncoding, /Widths=[250, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 250, 0, 500, 500, 500, 0, 0, 0, 0, 500], /FontDescriptor=Dictionary of type: /FontDescriptor, /FirstChar=32)
                Subdictionary /FontDescriptor = (/Type=/FontDescriptor, /StemV=116.867004, /Descent=-216, /FontWeight=700, /FontBBox=[-547, -307, 1206, 1032], /CapHeight=656, /FontFile2=Stream, /FontStretch=/Normal, /Flags=98, /XHeight=468, /FontFamily=Times New Roman, /FontName=/LDMIBN+TimesNewRomanPS-BoldItalicMT, /Ascent=891, /ItalicAngle=-15)
            Subdictionary /C2_0 = (/DescendantFonts=[238 0 R], /BaseFont=/LDMHPN+TimesNewRomanPS-BoldItalicMT, /Type=/Font, /Subtype=/Type0, /Encoding=/Identity-H, /ToUnicode=Stream)
            Subdictionary /TT3 = (/LastChar=169, /BaseFont=/LDMIEB+Tahoma, /Type=/Font, /Subtype=/TrueType, /Encoding=/WinAnsiEncoding, /Widths=[313, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 546, 0, 546, 0, 0, 546, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 929], /FontDescriptor=Dictionary of type: /FontDescriptor, /FirstChar=32)
                Subdictionary /FontDescriptor = (/Type=/FontDescriptor, /StemV=92, /Descent=-206, /FontWeight=400, /FontBBox=[-600, -208, 1338, 1034], /CapHeight=734, /FontFile2=Stream, /FontStretch=/Normal, /Flags=32, /XHeight=546, /FontFamily=Tahoma, /FontName=/LDMIEB+Tahoma, /Ascent=1000, /ItalicAngle=0)
        Subdictionary /Properties = (/MC0=Dictionary of type: /OCMD)
            Subdictionary /MC0 = (/Type=/OCMD, /OCGs=Dictionary of type: /OCG)
                Subdictionary /OCGs = (/Usage=Dictionary, /Type=/OCG, /Name=HeaderFooter)
                    Subdictionary /Usage = (/CreatorInfo=Dictionary, /PageElement=Dictionary)
                        Subdictionary /CreatorInfo = (/Creator=Acrobat PDFMaker 6.0 äëÿ Word)
                        Subdictionary /PageElement = (/SubType=/HF)

這里有很多活動部件。 你可能想把一個只有3或4個字符的測試文檔組合在一起...這里使用了很多類型1字體（除了TT字體之外），所以很難分辨您特定問題涉及的內容。

（你確定你不想至少嘗試使用iText嗎？;-)我不是說它會起作用，只是它可能值得一試）。

作為參考，使用com.lowagie.text.pdf.parser.PdfContentReaderTool類獲得上面的字典轉儲

Answer 5

嘗試使用這種結構：

PDFont font = PDType0Font.load( pdfFile, new File( "fonts/VREMACCI.TTF" ) );  // Windows Russian font imported to write the Russian text.
// Some code here to open the PDF & define a new page.
contentStream.beginText();
contentStream.setFont(font, 12);
contentStream.showText( "отделом компьютерной" ); // Write the Russian text.
contentStream.endText();

Answer 6

試試這個：

短語leftTitle = new Phrase（“САНКТ-ПЕТЕРБУРГ”，FontFactory.getFont（“Tahoma”，“Cp1251”，true，25））;

這至少適用於最新的（5.0.1）iText

使用Java PDFBox庫編寫俄語PDF

問題描述

6 個解決方案

解決方案1
5 2012-11-29 15:06:28

解決方案2
1 2018-07-09 09:16:37

解決方案3
0 2009-11-11 08:57:21

解決方案4
0 2009-11-12 03:29:52

解決方案5
0 已采納 2018-06-04 16:56:09

解決方案6
-1 2010-05-01 21:16:53

使用Java PDFBox庫編寫俄語PDF

問題描述

6 個解決方案

解決方案1 5 2012-11-29 15:06:28

解決方案2 1 2018-07-09 09:16:37

解決方案3 0 2009-11-11 08:57:21

解決方案4 0 2009-11-12 03:29:52

解決方案5 0 已采納 2018-06-04 16:56:09

解決方案6 -1 2010-05-01 21:16:53

解決方案1
5 2012-11-29 15:06:28

解決方案2
1 2018-07-09 09:16:37

解決方案3
0 2009-11-11 08:57:21

解決方案4
0 2009-11-12 03:29:52

解決方案5
0 已采納 2018-06-04 16:56:09

解決方案6
-1 2010-05-01 21:16:53