简体   繁体   English

iTextSharp:执行GetTextFromPage时指定编码

[英]iTextSharp: Specify encoding when doing GetTextFromPage

I'm using iTextSharp to extract some informations from a PDF file. 我正在使用iTextSharp从PDF文件中提取一些信息。 Everything is almost perfect(quite impressed in fact), I just have some issue with some words. 一切都几乎完美(实际上给人留下了深刻的印象),我只是有些话有些问题。

By example, in the PDF, I've the following sentence: 例如,在PDF中,我有以下句子:

Dès la fin de soirée, [...] Désla fin desoirée,[...]

When I look the PDF, I see exactly that, but when I receive the text from the following code: 当我查看PDF时,我完全可以看到,但是当我收到以下代码的文本时:

ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
    string currentPageText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
    if (currentPageText.Contains(PAGE_MARKER))
    {
        return currentPageText;
    }
}

The text in question is the following: 有问题的文字如下:

Dès la fi n de soirée, [...] Dèsla finnsoirée,[...]

It's strange, but the "fi" are in fact only one character, and there is a space. 这很奇怪,但是“ fi”实际上只是一个字符,并且有一个空格。

When I open the same PDF in Foxit reader or Adobe acrobat, it's looking fine, but if I copy paste the text, I've the following text: 当我在Foxit阅读器或Adobe acrobat中打开相同的PDF时,看起来还不错,但是如果我复制粘贴该文本,则会得到以下文本:

Dès la fi n de soirée, [...] (So the correct characters but with one space) Désla fi n desoirée,[因此是正确的字符,但只有一个空格)

It's one example, but I've some another ones. 这是一个例子,但我还有另外一个例子。

Any idea how to fix this? 任何想法如何解决这个问题?

For this to make sense, you need some background in pdf syntax. 为此,您需要使用pdf语法的一些背景知识。

In the most rudimentary form, a pdf document contains only the instructions needed to render a document in a viewer. 以最基本的形式,pdf文档仅包含在查看器中呈现文档所需的说明。 In other words, there is not concept of "text" being rendered. 换句话说,没有呈现“文本”的概念。 Just something like "draw character 'A' at location 150, 877" and so on. 就像“在150、877位置绘制字符'A'”之类的东西一样。

In fact, this is a snippet from a .pdf document (opened with a simple text editor) 实际上,这是一个.pdf文档的片段(使用简单的文本编辑器打开)

[a, -28.7356, p, 27.2652, p, 27.2652, e, -27.2652, a, -28.7356, r, 64.6889, a, -28.7356, n, 27.2652, c, -38.7594, e, 444] TJ

TJ is the "draw text" instruction. TJ是“绘制文本”指令。 The array contains pairs for characters and their kerning info. 该数组包含字符对及其字距调整信息。

Now, for any kind of text extraction to work (both in iText and in the copy/paste functionality of Foxit, Adobe, etc) you need a bit of guesswork. 现在,要使任何类型的文本提取正常工作(在iText中以及在Foxit,Adobe等的复制/粘贴功能中),您都需要一些猜测。 (A heuristic as it commonly called). (通常称为启发式)。

You need to decide when some characters get stuck together and form a word, and when two characters are far enough apart that there should be a space between them. 您需要确定何时将某些字符粘在一起形成一个单词,以及何时两个字符相距足够远以至于它们之间应该有一个空格。

In your usecase, it seems that the distance between "n" and "i" is greater than the expected distance for that font. 在您的用例中,“ n”和“ i”之间的距离似乎大于该字体的预期距离。

Sadly, iText will not be able to (easily) help you there. 可悲的是,iText将无法(轻松)为您提供帮助。 Since the input document simply seems to be incorrect. 由于输入文档似乎根本不正确。 Or rather most readers/viewers seem to be getting it wrong, so it is likely to simply be an issue in the pdf. 或更确切地说,大多数读者/查看者似乎都把它弄错了,因此它很可能只是pdf中的一个问题。

Of course, you can implement the TextExtractionStrategy. 当然,您可以实现TextExtractionStrategy。 This class gives you access to the TextRenderInfo objects that contain the characters and graphic state in the pdf. 此类使您可以访问包含pdf中的字符和图形状态的TextRenderInfo对象。 Most TextExtractionStrategies will then check what the size of a space is in the font being used, and use that as a reference to decide when characters should be concatenated, and when they should be separated. 然后,大多数TextExtractionStrategies都会检查正在使用的字体中空格的大小,并以此为参考来确定何时应该串联字符以及何时分开字符。

Lastly, if you want to delve deeper into this issue, you could always attach the input document. 最后,如果您想更深入地研究此问题,可以随时附加输入文档。

Kind regards, Joris 亲切的问候,乔里斯

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM