简体   繁体   中英

How to read old word doc file metadata

Suppose I want to import a word file with doc extension into my HTML document, along with the metadata, and display it in a div accordingly. So all existing stuff in the doc file, like texts in varied formats (bold, italics, different size, letter spacing, line-height, overline, unerline..), images (both their positions and sizes), graphs, charts (the JSP will generate the necessary graphics to provide a similar looking graph or chart. It needs only the data), lists, etc.

So is there any way to do this? Is there any standardized Word API which will give us this data? Or any JSP library that can do it? If not, then what do I need to know and do to get this?

查看 Apache POI 项目: http : //poi.apache.org/text-extraction.html以及 Apache Tika: http : //tika.apache.org/

And 5 years later, the answer:

NOTE : this code works for old word 'doc' files only (not docx), Apache POI can also handle docx but you must use another API.

Using Apache POI , maven dependencies:

<!-- https://mvnrepository.com/artifact/org.apache.poi/poi -->
<dependency>
  <groupId>org.apache.poi</groupId>
  <artifactId>poi</artifactId>
  <version>3.17</version>
</dependency>

And here is the code:

  ...
  import org.apache.poi.poifs.filesystem.DirectoryEntry;
  import org.apache.poi.poifs.filesystem.DocumentEntry;
  import org.apache.poi.poifs.filesystem.DocumentInputStream;
  import org.apache.poi.poifs.filesystem.POIFSFileSystem;

  public static void main(final String[] args) throws FileNotFoundException, IOException, NoPropertySetStreamException,
                  MarkUnsupportedException, UnexpectedPropertySetTypeException {
      try (final FileInputStream fs = new FileInputStream("src/test/word_template.doc");
        final POIFSFileSystem poifs = new POIFSFileSystem(fs)) {
        final DirectoryEntry dir = poifs.getRoot();
        final DocumentEntry siEntry = (DocumentEntry) dir.getEntry(SummaryInformation.DEFAULT_STREAM_NAME);
        try (final DocumentInputStream dis = new DocumentInputStream(siEntry)) {
          final PropertySet ps = new PropertySet(dis);
          final SummaryInformation si = new SummaryInformation(ps);
          // Read word doc (not docx) metadata.
          System.out.println(si.getLastAuthor());
          System.out.println(si.getAuthor());
          System.out.println(si.getKeywords());
          System.out.println(si.getSubject());
          // ...
        }
      }
    }

To read the text content you will need additional dependencies:

<dependency>
  <!-- Required for HWPFDocument -->
  <groupId>org.apache.poi</groupId>
  <artifactId>poi-scratchpad</artifactId>
  <version>3.17</version>
</dependency>

Code:

try (final HWPFDocument doc = new HWPFDocument(fs)) {
  return doc.getText().toString();
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM