简体   繁体   English

从PDF中提取字体名称

[英]Extract Font Name from PDF

I am using pdf.js to extract text from the pdf but the font name appears as g_d0_f6 etc. I need the font name to use the appropriate table for converting to Unicode. 我正在使用pdf.js从pdf中提取文本,但是字体名称显示为g_d0_f6等。我需要字体名称才能使用适当的表转换为Unicode。 Here is the code derived from pdf2svg.js sample:- 这是从pdf2svg.js示例获得的代码:-

var fs = require('fs');
var util = require('util');
var path = require('path');
var stream = require('stream');

// HACK few hacks to let PDF.js be loaded not as a module in global space.
require('./domstubs.js').setStubs(global);

var pdfjsLib = require('pdfjs-dist');

var pdfPath = process.argv[2] || '../../web/compressed.tracemonkey-pldi-09.pdf';
var data = new Uint8Array(fs.readFileSync(pdfPath));

var loadingTask = pdfjsLib.getDocument({
  data: data,
  nativeImageDecoderSupport: pdfjsLib.NativeImageDecoding.DISPLAY,
});
loadingTask.promise.then(function(doc) {
  var lastPromise = Promise.resolve(); // will be used to chain promises
  var loadPage = function (pageNum) {
    return doc.getPage(pageNum).then(function (page) {
      return page.getTextContent().then(function (textContent) {
    console.log(textContent);
        });
      });
    };

  for (var i = 1; i <= doc.numPages; i++) {
    lastPromise = lastPromise.then(loadPage.bind(null, i));
  }
  return lastPromise;
}).then(function () {
  console.log('# End of Document');
}, function (err) {
  console.error('Error: ' + err);
});

Sample output:- 样本输出:

{ items: 
   [ { str: 'bl fp=k osQ ckjs esa cPpksa ls ckrphr djsa & ;g LowQy esa fdl le; dk n`\'; gS\\ cPps',
       dir: 'ltr',
       width: 396.2250000000001,
       height: 15,
       transform: [Array],
       fontName: 'g_d0_f1' },
     { str: 'D;k dj jgs gSa\\ cPps dkSu&dkSu ls [ksy] [ksy j',
       dir: 'ltr',
       width: 216.1650000000001,
       height: 15,
       transform: [Array],
       fontName: 'g_d0_f1' },
     { str: 'g',
       dir: 'ltr',
       width: 6.42,
       height: 15,
       transform: [Array],
       fontName: 'g_d0_f1' },
     { str: 's gSa\\ fp=k esa fdrus cPps gSa vkSj fdrus',
       dir: 'ltr',
       width: 173.865,
       height: 15,
       transform: [Array],
       fontName: 'g_d0_f1' },
     { str: 'cM+s gSa\\ vkil esa dkSu D;k ckr dj jgk gksxk\\ cPpksa ls fp=k esa lcosQ fy, uke lkspus',
       dir: 'ltr',
       width: 396.54000000000013,
       height: 15,
       transform: [Array],
       fontName: 'g_d0_f1' },
     { str: 'dks dgasaA',
       dir: 'ltr',
       width: 40.74,
       height: 15,
       transform: [Array],
       fontName: 'g_d0_f1' },
     { str: 'csVh cpkvks',
       dir: 'ltr',
       width: 66.725,
       height: 17,
       transform: [Array],
       fontName: 'g_d0_f2' },
     { str: 'csVh i<+kvksA',
       dir: 'ltr',
       width: 66.75899999999999,
       height: 17,
       transform: [Array],
       fontName: 'g_d0_f2' },
     { str: '2018-19',
       dir: 'ltr',
       width: 36.690000000000005,
       height: 10,
       transform: [Array],
       fontName: 'g_d0_f3' } ],
  styles: 
   { g_d0_f1: 
      { fontFamily: 'sans-serif',
        ascent: 0.837,
        descent: -0.216,
        vertical: false },
     g_d0_f2: 
      { fontFamily: 'sans-serif',
        ascent: 0.786,
        descent: -0.181,
        vertical: false },
     g_d0_f3: 
      { fontFamily: 'sans-serif',
        ascent: 0.9052734375,
        descent: -0.2119140625,
        vertical: false } } }

And here is the pdf that uses embedded fonts: http://ncert.nic.in/textbook/pdf/ahhn101.pdf 这是使用嵌入字体的pdf: http : //ncert.nic.in/textbook/pdf/ahhn101.pdf

Here is a related question but the suggested commonObjs is empty: pdf.js get info about embedded fonts 这是一个相关的问题,但建议的commonObjs为空: pdf.js获取有关嵌入式字体的信息

Note: The answer below does not have anything to do with pdf.js, however it answers the question, Extract Font Name from PDF . 注意:以下答案与pdf.js无关,但是它回答了从PDF提取字体名称的问题。

I did not find a solution yet, so I went ahead and grabbed mutool , which has has the following command to get the font information per page. 我还没有找到解决方案,所以我继续学习mutool ,它具有以下命令来获取每页的字体信息。

mutool info -F input.pdf 0-2147483647

Then I grabbed the spawn function, hacked the output through some regex and pattern matching to return the data. 然后,我抓住了spawn函数,通过一些正则表达式和模式匹配对输出进行了破解,以返回数据。

const extractFontData = async str => {
  const getMatches = str => {
    const regex = /Page (\d+):\nFonts \((\d+)\):/;
    const match = str.match(regex);
    if (match) {
      return { page: match[1], fonts: match[2] };
    }
    return {};
  };

  const singleFont = fontData => {
    const match = fontData.match(/\+([a-zA-Z0-9_-]+[.,]?[a-zA-Z0-9_-]+)/);
    return match && match[1];
  };

  return str
    .split("Page ")
    .map(singlePageData => {
      const { page, fonts } = getMatches(`Page ` + singlePageData);
      if (fonts) {
        const split = singlePageData.split("\n").filter(e => e.length);
        const fontList = split.slice(2).map(singleFont);
        return { page, fonts, fontList };
      }
    })
    .filter(e => e);
};

// Taken and adjusted from: https://stackoverflow.com/a/52611536/6161265
function run(...cmd) {
  return new Promise((resolve, reject) => {
    var { spawn } = require("child_process");
    var command = spawn(...cmd);
    var result = "";
    command.stdout.on("data", function(data) {
      result += data.toString();
    });
    command.on("close", function(code) {
      resolve(result);
    });
    command.on("error", function(err) {
      reject(err);
    });
  });
}

async function wrapper(filePath) {
  const data = await run("mutool", ["info", "-F", filePath, "0-2147483647"]);
  return extractFontData(data);
}

Sample usage: 用法示例:

wrapper("ahhn101.pdf").then(data => console.log(data));

Result: 结果: 在此处输入图片说明

I think you were on the right track: page.commonObjs is where the actual font name is found. 我认为您在正确的轨道上: page.commonObjs是找到实际字体名称的地方。 However, page.commonObjs only gets populated when the page's text/operators are accessed, so you'll find it empty if you look before that happens. 但是, page.commonObjs仅在访问页面的文本/运算符时填充,因此,如果在此之前查看,则会发现它为空。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM