簡體   English   中英

如何使用POI api讀取java中的doc和docx文件

[英]How to read doc and docx file in java with POI api

我正在嘗試閱讀doc和docx文件。 這是代碼:

  static String distination="E:\\         
  static String docFileName="Requirements.docx";
 public static void main(String[] args) throws FileNotFoundException, IOException {
    // TODO code application logic here
    ReadFile rf= new ReadFile();
    rf.ReadFileParagraph(distination+docFileName);


  }
  public void ReadFileParagraph(String path) throws FileNotFoundException, IOException
    {
        FileInputStream fis;
        File file = new File(path);
        fis=new FileInputStream(file.getAbsolutePath());
           String filename=file.getName();

        String fileExtension=fileExtension(path);
        if(fileExtension.equals("doc"))
        {
             HWPFDocument document=new HWPFDocument(fis);
             WordExtractor DocExtractor = new WordExtractor(document);
             ReadDocFile(DocExtractor,filename);

        }
        else if(fileExtension.equals("docx"))
        {

            XWPFDocument documentX = new XWPFDocument(fis);            
            List<XWPFParagraph> pera =documentX.getParagraphs();
            ReadDocXFile(pera,filename);
        }
        else
        {
            System.out.println("format does not match");
        }

    }
    public void ReadDocFile(WordExtractor extractor,String filename)
    {

        for (String paragraph : extractor.getParagraphText()) {
            System.out.println("Peragraph: "+paragraph);
        }
    }
    public void ReadDocXFile(List<XWPFParagraph> extractor,String filename)
    {

        for (XWPFParagraph paragraph : extractor) {
          System.out.println("Question: "+paragraph.getParagraphText());
        }

    }
    public String fileExtension(String filename)
    {

       String extension = filename.substring(filename.lastIndexOf(".") + 1, filename.length());
       return extension;
    }

當我想讀取docx文件時,此代碼會出現異常:

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/xmlbeans/XmlException
    at l3s.readfiles.db.ReadFile.ReadFileParagraph(ReadFile.java:52)
    at autometictagdetection.TagDetection.main(TagDetection.java:36)
Caused by: java.lang.ClassNotFoundException: org.apache.xmlbeans.XmlException
    at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:423)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:356)
    ... 2 more
Java Result: 1

另一個問題是,當我想要讀取一個Doc文件時,它會很好地讀取一些文件,但對於某些文件,它會給出一個例外

    Exception in thread "main" org.apache.poi.hwpf.OldWordFileFormatException: The               document is too old - Word 95 or older. Try HWPFOldDocument instead?
    at org.apache.poi.hwpf.HWPFDocument.<init>(HWPFDocument.java:222)
    at org.apache.poi.hwpf.HWPFDocument.<init>(HWPFDocument.java:186)
    at org.apache.poi.hwpf.HWPFDocument.<init>(HWPFDocument.java:174)
    at l3s.readfiles.db.ReadFile.ReadFileParagraph(ReadFile.java:44)
    at autometictagdetection.TagDetection.main(TagDetection.java:36)
Java Result: 1

我在http://poi.apache.org/hwpf/index.html中看到POI API支持單詞6和單詞95。 請有人能解決這兩個問題嗎?

重新提出你的第一個問題,我想你需要參考項目中的依賴性。

就是我猜:

poi-ooxml-schemas xmlbeans,位於poi-ooxml-schemas-version-yyyymmdd.jar中

(來自Apache POI頁面 )。

是Apache XMLBeans頁面。

我無法列出你需要的每個庫,但你可以通過maven找出...

核心maven依賴關系需要這是問題1的解決方案

<dependency>
            <groupId>org.apache.poi</groupId>
            <artifactId>poi</artifactId>
            <version>3.15</version>
        </dependency>
        <!-- For .DOCX FILES -->
        <dependency>
            <groupId>org.apache.poi</groupId>
            <artifactId>poi-ooxml</artifactId>
            <version>3.15</version>
        </dependency>
       <!-- For .DOC FILES -->
        <dependency>
           <groupId>org.apache.poi</groupId>
            <artifactId>poi-scratchpad</artifactId>
            <version>3.9</version>
        </dependency>

對於問題2從原始源代碼,似乎POI不支持文件太舊

  /**
   * This constructor loads a Word document from a specific point
   *  in a POIFSFileSystem, probably not the default.
   * Used typically to open embeded documents.
   *
   * @param directory The DirectoryNode that contains the Word document.
   * @throws IOException If there is an unexpected IOException from the passed
   *         in POIFSFileSystem.
   */
  public HWPFDocument(DirectoryNode directory) throws IOException
  {
    // Load the main stream and FIB
    // Also handles HPSF bits
    super(directory);

    // Is this document too old for us?
    if(_fib.getFibBase().getNFib() < 106) {
        throw new OldWordFileFormatException("The document is too old - Word 95 or older. Try HWPFOldDocument instead?");
    }

HWPFDocument的源代碼

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM