简体   繁体   English

如何使用PDF.js阅读页脚文本?

[英]How can I read the footer text using PDF.js?

I'm trying to extract DOIs from scientific papers, and as these are almost always located in the page footer, I'd like to try this strategy before going through the main text. 我正在尝试从科学论文中提取DOI,并且由于这些DOI几乎总是位于页面页脚中,因此我想在阅读全文之前尝试此策略。

Here is my current approach, using Mozilla's pdf.js to search the first page of an arbitrary PDF. 这是我目前的方法,使用Mozilla的pdf.js搜索任意PDF的第一页。

var Promise = require('bluebird');
const doiRegex = new RegExp('\b(10[.][0-9]{4,}(?:[.][0-9]+)*/(?:(?!["&\'<>])[[:graph:]])+)\b', 'i');

function pdfgrep(fileObj) {
    return Promise.spawn(function* () {
        var pdf = yield pdfjs.getDocument(fileObj.path);
        console.log(pdf);
        var page = yield pdf.getPage(1);
        var text = yield page.getTextContent();

        for (var s of text.items) {
            var match = s.str.match(regex);
            if (match !== null) {
                return match;
            }
        }

        return null;
    });
}

Here is a PDF on which this method can be tested. 是可以测试此方法的PDF。 Note that the DOI is located in the footer, and can be located using the search tool in any run-of-the-mill PDF viewer. 请注意,DOI位于页脚中,并且可以在任何常规PDF查看器中使用搜索工具进行定位。 However, pdf.getPage doesn't seem to include any text from the footer. 但是, pdf.getPage似乎不包含页脚中的任何文本。

  1. How can I access the footer text with PDF.js? 如何使用PDF.js访问页脚文本?
  2. Failing that, are there any other tools I could use to do this? 如果失败,还有其他工具可以用来执行此操作吗?

The RegExp was not properly written: RegExp编写不正确:

  • \\b are not escaped in the string, shall be \\\\b \\b不能在字符串中转义,应为\\\\b
  • [:graph:] might not work [:graph:]可能不起作用

The following was meant: 意思是:

var doiRegex = /\b(10[.][0-9]{4,}(?:[.][0-9]+)*\/(?:(?!["&\'<>])[\x21-\x7E])+)\b/i;

getTextContent() result returns text items with their positions on the page. getTextContent()结果返回文本项及其在页面上的位置。 Often PDF.js cannot combine individual characters into text runs because some PDF generators print individual glyph into separate positions, but that's improved in new versions of PDF.js (BTW, which version of PDF.js are you using?). 通常,PDF.js不能将单个字符组合到文本行中,因为某些PDF生成器将单个字形打印到单独的位置,但这在新版本的PDF.js(顺便说一句,您正在使用哪个版本的PDF.js?)中得到了改进。 Try to glue the text runs yourself before matching: 尝试在匹配之前自行粘贴文本:

...
var text = yield page.getTextContent();
var str = text.items.map(function (s) {
    return s.str;
}).join('');
var match = str.match(regex);
return match;
...

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM