简体繁体中英

How to extract text from .doc document using apache poi?

原文 2013-03-23 17:57:40 8 1 java/ ms-word/ apache-poi/ doc

I used some code snippets below for extracting text from .doc file

HWPFDocument document = new HWPFDocument(new FileInputStream(inputFile));
Range range = document.getRange();
        int len = range.numParagraphs();
        StringBuilder builder = new StringBuilder();

        for (int i = 0; i < len; i++) {
            builder.append(range.getParagraph(i).text());
        }

and

HWPFDocument document = new HWPFDocument(new FileInputStream(inputFile));
WordExtractor wordExtractor = new WordExtractor(document);
        String[] paragraphs = wordExtractor.getParagraphText();
        StringBuilder builder = new StringBuilder();
        for (String p : paragraphs) {
            builder.append(p);
        }

However, both of them always output some strange characters. ex: ?PAGEREF_Toc351848910\\h10?HYPERLINK\\l _Toc351848911 ?CITATIONPla\\l1033[?HYPERLINK\\l"Pla"13] . So, I want to know where are they from and how to remove them when extracting text from .doc file

Thanks in advance

1 answers

I hope this may give you some insight.

    private static void ConvertDoctoPdf(String src, String outputPdf) throws Exception {

        try {
            Document pdfdoc = new Document();

            HWPFDocument doc = new HWPFDocument(new FileInputStream(src));

            //create wordextractor object to wrap the extracted word from HWPFDocument object.
            WordExtractor we = new WordExtractor(doc);

            OutputStream outputFile = new FileOutputStream(new File(desc));

            //create a pdf writer object to write text to mypdf.pdf file
            PdfWriter.getInstance(pdfdoc, outputFile);

            pdfdoc.open();

            Paragraph para = new Paragraph();

            //Collecting all paragraphs
            String[] paragraphs = we.getParagraphText();

            for (int i = 0; i < paragraphs.length; i++) {
                //add the paragraph to the document
                para.add(paragraphs[i]);
                //para.add(new Chunk(Chunk.NEWLINE));
                }
            //print all paragraph together
            System.out.println(para);    
            //Add all paragraph together to pdfdoc document.
            pdfdoc.add(para);

            pdfdoc.close();
            we.close();
            }  catch (Exception e) {
            e.printStackTrace();

        }
    }

How can I extract right-to-left text from .doc and .docx files using Apache POI in java?

Text replacement in WinWord doc using Apache POI

Extract Paragraph from Word Document Using Apache POI

Apache POI extract hyperlinks from word document

Extracting hyperlinks from .doc using Apache POI

How to create word doc using apache POI

How to convert .docx to .doc using apache poi

how to extract text from ppt, pptx file except footer, slide number using apache poi?

How can I extract raw text from PDFs using Apache POI?

How to extract font family from OOXML using Apache POI?

暂无

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question How can I extract right-to-left text from .doc and .docx files using Apache POI in java? Text replacement in WinWord doc using Apache POI Extract Paragraph from Word Document Using Apache POI Apache POI extract hyperlinks from word document Extracting hyperlinks from .doc using Apache POI How to create word doc using apache POI How to convert .docx to .doc using apache poi how to extract text from ppt, pptx file except footer, slide number using apache poi? How can I extract raw text from PDFs using Apache POI? How to extract font family from OOXML using Apache POI?

Related Tags

粤ICP备18138465号 © 2020-2024 STACKOOM.COM