简体   繁体   中英

How to extract content of doc/docx using fs api of node.js

The following works well to extract the content of the doc/docs type. My intention is to extract only the string and not images. If the code is fed with any document which contains images, it unable to process it renders enormous text that is not understood by human. Is there any way for fs module to skip the images and extract only string ?

var fs = require("fs");
fs.readFile("Protractor.docx", 'utf8', function (err,data) {
    if (err) {
      return console.log(err);
    }
    console.log(data);
});

You can use mammoth library which have a extractRawText method, this only extract the text and it will ignore images and all formatting.

This is an example which extract from a docx file containing images :

const superagent = require('superagent');
const mammoth = require('mammoth');

const url = 'http://www.ojk.ee/sites/default/files/respondus-docx-sample-file_0.docx';

const main = async () => {

 const response = await superagent.get(url)
   .parse(superagent.parse.image)
   .buffer();

  const buffer = response.body;

  const text = (await mammoth.extractRawText({ buffer })).value;
  const lines = text.split('\n');

  console.log(lines);
};

main().catch(error => console.error(error));

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM