简体   繁体   English

PDFJS 在转换为文本的 pdf forms 上丢失复选标记

[英]PDFJS losing check marks on pdf forms that are converted to text

I have been using an adaptation of code from these posts:我一直在使用这些帖子的代码改编版:

PDF to Text extractor in nodejs without OS dependencies PDF 到没有操作系统依赖性的 nodejs 中的文本提取器

pdfjs: get raw text from pdf with correct newline/withespace pdfjs:使用正确的换行符/空格从 pdf 获取原始文本

to convert pdfs to text:将 pdf 转换为文本:

import pdfjsLib from 'pdfjs-dist/legacy/build/pdf.js';

import {
    TextItem,
    DocumentInitParameters,
} from 'pdfjs-dist/types/src/display/api';

const getPageText = async (pdf: pdfjsLib.PDFDocumentProxy, pageNo: number) => {
    const page = await pdf.getPage(pageNo);
    const tokenizedText = await page.getTextContent();
    var textItems = tokenizedText.items;
    var finalString = '';
    var line = 0;

    // Concatenate the string of the item to the final string
    for (var i = 0; i < textItems.length; i++) {
        if (line != (textItems[i] as TextItem).transform[5]) {
            if (line != 0) {
                finalString += '\r\n';
            }

            line = (textItems[i] as TextItem).transform[5];
        }
        var item = textItems[i];

        finalString += (item as TextItem).str;
    }
    return finalString;
};

export const getPDFText = async (
    data: string,
    password: string | undefined = undefined
) => {
    const initParams: DocumentInitParameters = {
         data: Buffer.from(data, 'base64'),
        //useSystemFonts: true,
        //disableFontFace: false,
        standardFontDataUrl: 'standard_fonts/'
    };

    if (password !== undefined) {
        initParams.password = password;
    }

    const pdf = await pdfjsLib.getDocument(initParams).promise;
    const maxPages = pdf.numPages;
    const pageTextPromises = [];
    for (let pageNo = 1; pageNo <= maxPages; pageNo += 1) {
        pageTextPromises.push(getPageText(pdf, pageNo));
    }
    const pageTexts = await Promise.all(pageTextPromises);
    const joined = pageTexts.join(' ');
    return joined;
};

With version 3.1.81 of pdfjs-dist this looks pretty good, but checkboxes on form fields are lost and text field's values show up at the end of each page instead of remaining in context.对于 pdfjs-dist 的 3.1.81 版本,这看起来很不错,但是表单字段上的复选框丢失了,文本字段的值显示在每个页面的末尾,而不是保留在上下文中。 I feel like this page: https://pdftotext.com/ uses pdfjs based on similarities with my output, but they get the checks on the boxes and their text field "answers" are by the question.我觉得这个页面: https://pdftotext.com/基于与我的 output 的相似性使用 pdfjs,但他们在方框上打勾,他们的文本字段“答案”由问题决定。

Run with:运行:

import { join } from 'path';
import { readFileSync } from 'fs';

const rawContents = readFileSync(join('directory', 'file.pdf'), 'base64');

const pdfText = await getPDFText(rawContents as string);

Anyone have an idea why I am losing the checks (the boxes are there)?任何人都知道为什么我丢失了支票(盒子在那里)?

Sample of what I get:我得到的样本:

22. when something something?
☐ 0-3 months ago
☐ 4-6 months ago
☐ 7-12 months ago
☐ 13-18 months ago
☐ 19-24 months ago
☐ 25-60 months ago
☐ I don't know

here is what that webpage gets:这是该网页获得的内容:

22. when something something?

✔ 0-3 months ago
☐
☐ 4-6 months ago

☐ 7-12 months ago

☐ 13-18 months ago

☐ 19-24 months ago

☐ 25-60 months ago

☐ I don’t know

Again, my output looks like theirs but has lost these checks.同样,我的 output 看起来像他们的,但丢失了这些支票。 I don't know for sure they use pdfjs but i think they do.我不确定他们使用 pdfjs,但我认为他们使用。

Note that I have downloaded a put a couple fonts in the standard_fonts directory.请注意,我已经在 standard_fonts 目录中下载了一对 fonts。 Should I copy them all even if I see no warning message?即使我没有看到警告消息,我是否应该全部复制它们?

In forms Check Boxes are a field boundary not part of any nearby text (true of all fields they are not directly connected to their description), they simply have a name and value, Here Check Box1 & Box2 are placed and Box3 is awaiting surface appearance.在表单中,复选框是一个字段边界,不是任何附近文本的一部分(所有字段都是如此,它们不直接与其描述相关),它们只有一个名称和值,这里放置了复选框 1 和 Box2,而 Box3 正在等待表面外观.

NOTE especially they are not of fixed appearance they morph when displayed they are chimera looking like they are present.请特别注意,它们的外观不是固定的,它们在显示时会变形,它们是嵌合体,看起来就像它们存在一样。

在此处输入图像描述

In these AcroForm cases they have no native plain text equivalence, there is nothing to detect the index is simply pointing to page co-ordinates.在这些 AcroForm 案例中,它们没有原生的纯文本等价物,没有什么可以检测到索引只是指向页面坐标。

PDF.js is a PDF2HTML converter so can easily, display those indexed areas as html fields, PDF.js 是一个 PDF2HTML 转换器,因此可以轻松地将这些索引区域显示为 html 字段,
NOTE ITS AN X注意它的 AN X

在此处输入图像描述

In terms of PDF extractable surface there is no text, and we can see for the boxes above and below there is only a description as seen alongside those radio boxes就 PDF 可提取表面而言,没有文本,我们可以看到上方和下方的框只有那些单选框旁边的描述

NOTE ITS A TICK nothing differs except the displayer (viewer)注意它是一个勾号,除了显示器(查看器)外没有什么不同

在此处输入图像描述

If we try to extract text using PDF.js (here in browser) we get just the text如果我们尝试使用 PDF.js(在浏览器中)提取文本,我们只会得到文本

在此处输入图像描述

In some cases where Symbol or ZapfDingbats native fonts or other TTF with those code points have been embeded and adapted for state it may be possible to get a fonted checkmark symbol but it is rare, except when designed especially.在某些情况下,Symbol 或 ZapfDingbats 本机字体或其他带有这些代码点的 TTF 已被嵌入并适应状态,可能会得到一个字体复选标记符号,但这种情况很少见,除非特别设计。

☐ as you see in your case then to replace with one ☐ 如您所见,然后用一个替换
☑ is picking the correct one from font and add as ☑ 从字体中选择正确的并添加为
☒ replacement its not very easy but doable. ☒ 更换不是很容易但可行。

For anyone else out there looking:对于其他人来说:

https://formulae.brew.sh/formula/poppler

this includes pdftotext command which gets checkmarks这包括获取复选标记的pdftotext命令

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM