简体   繁体   English

Apache POI HWPF - 将doc文件转换为pdf时出现问题

[英]Apache POI HWPF - problem in convert doc file to pdf

I am currently working Java project with use of apache poi. 我目前正在使用apache poi工作Java项目。 Now in my project I want to convert doc file to pdf file. 现在,在我的项目中,我想将doc文件转换为pdf文件。 The conversion done successfully but I only get text in pdf not any text style or text colour. 转换成功完成但我只获得pdf中的文本而不是任何文本样式或文本颜色。 My pdf file looks like a black & white. 我的pdf文件看起来像黑白。 While my doc file is coloured and have different style of text. 虽然我的doc文件是彩色的,并且具有不同的文本样式。

This is my code, 这是我的代码,

 POIFSFileSystem fs = null;  
 Document document = new Document(); 

 try {  
     System.out.println("Starting the test");  
     fs = new POIFSFileSystem(new FileInputStream("/document/test2.doc"));  

     HWPFDocument doc = new HWPFDocument(fs);  
     WordExtractor we = new WordExtractor(doc);  

     OutputStream file = new FileOutputStream(new File("/document/test.pdf")); 

     PdfWriter writer = PdfWriter.getInstance(document, file);  

     Range range = doc.getRange();
     document.open();  
     writer.setPageEmpty(true);  
     document.newPage();  
     writer.setPageEmpty(true);  

     String[] paragraphs = we.getParagraphText();  
     for (int i = 0; i < paragraphs.length; i++) {  

         org.apache.poi.hwpf.usermodel.Paragraph pr = range.getParagraph(i);
        // CharacterRun run = pr.getCharacterRun(i);
        // run.setBold(true);
        // run.setCapitalized(true);
        // run.setItalic(true);
         paragraphs[i] = paragraphs[i].replaceAll("\\cM?\r?\n", "");  
     System.out.println("Length:" + paragraphs[i].length());  
     System.out.println("Paragraph" + i + ": " + paragraphs[i].toString());  

     // add the paragraph to the document  
     document.add(new Paragraph(paragraphs[i]));  
     }  

     System.out.println("Document testing completed");  
 } catch (Exception e) {  
     System.out.println("Exception during test");  
     e.printStackTrace();  
 } finally {  
                 // close the document  
    document.close();  
             }  
 }  

please help me. 请帮我。

Thnx in advance. Thnx提前。

If you look at Apache Tika, there's a good example of reading some style information from a HWPF document. 如果你看看Apache Tika,有一个从HWPF文档中读取一些样式信息的好例子。 The code in Tika generates HTML based on the HWPF contents, but you should find that something very similar works for your case. Tika中的代码基于HWPF内容生成HTML,但您应该发现非常类似的东西适合您的情况。

The Tika class is https://svn.apache.org/repos/asf/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java Tika类是https://svn.apache.org/repos/asf/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java

One thing to note about word documents is that everything in any one Character Run has the same formatting applied to it. 关于word文档需要注意的一点是,任何一个Character Run中的所有内容都应用了相同的格式。 A Paragraph is therefore made up of one or more Character Runs. 因此,段落由一个或多个角色运行组成。 Some styling is applied to a Paragraph, and other parts are done on the runs. 一些样式应用于段落,其他部分在运行中完成。 Depending on what formatting interests you, it may therefore be on the paragraph or the run. 根据您感兴趣的格式,它可能位于段落或运行中。

If you use WordExtractor, you will get text only. 如果您使用WordExtractor,您将只获得文本。 Try using CharacterRun class. 尝试使用CharacterRun类。 You will get style along with text. 你会得到风格和文字。 Please refer following Sample code. 请参阅以下示例代码。

Range range = doc.getRange();
for (int i = 0; i < range.numParagraphs(); i++) {
    org.apache.poi.hwpf.usermodel.Paragraph poiPara = range.getParagraph(i);
    int j = 0;
    while (true) {
        CharacterRun run = poiPara.getCharacterRun(j++);
        System.out.println("Color "+run.getColor());
        System.out.println("Font size "+run.getFontSize());
        System.out.println("Font Name "+run.getFontName());
        System.out.println(run.isBold()+" "+run.isItalic()+" "+run.getUnderlineCode());
        System.out.println("Text is "+run.text());
        if (run.getEndOffset() == poiPara.getEndOffset()) {
            break;
        }
    }
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM