简体繁体 English

用于在OCR PDF中搜索空格/断行词的正则表达式（goo d ni g ht）

[英]Regex expression for searching spaced/broken words in OCR PDFs (goo d ni g ht)

原文 2014-04-24 23:42:18 7 1 regex/ pdf/ ocr/ space

I need searching lots of OCR PDFs. 我需要搜索许多OCR PDF。 I realized the words and sentences are perfect visually, but if I copy an paste the content, there are spaces which shouldn't be there! 我意识到单词和句子在视觉上是完美的，但是如果我复制粘贴内容，那么其中的空格将不存在！

I can see in the text: good night 我可以在文字中看到： good night

If I copy and paste somewhere: goo d ni g ht 如果我复制并粘贴到某处： goo d ni g ht

I would appreciate advices to handle this situation through a Regex expression considering: 考虑到通过Regex表达式处理这种情况的建议，我将不胜感激：

a) The simple example for short words as \\bgood night\\b for goo d ni g ht a）用于短字作为简单的例子\\bgood night\\b为goo d ni g ht

b) When there is line break in the sentence. b）句子中有换行符时。 I mean, the Regex expression isn't able to search from one line to another in the PDF even the paragraph is the same. 我的意思是，即使段落相同，Regex表达式也无法从PDF中的一行搜索到另一行。 In looking for \\bthe sun set and the night comes\\b , but the PDF content is like that when pasted: 寻找\\bthe sun set and the night comes\\b ，但PDF内容类似于粘贴时的内容：

line 1: t he sun set an d th e 第1行： t he sun set an d th e

line 2: nig ht co m es 第2行： nig ht co m es

Many thanks, Cadu 非常感谢，卡杜

1 个解决方案

This random occurence of spaces in the middle of words can happen in PDF. 单词中间空格的这种随机出现可以在PDF中发生。 The reason behind it is the complex format that PDF actually is. 其背后的原因是PDF实际上是复杂的格式。 You see, a PDF document is actually a container of instructions for rendering the text in a viewer. 您会看到，PDF文档实际上是在查看器中呈现文本的说明容器。

Imagine instructions like: 想象一下这样的指令：

go to position 50, 50. 转到位置50、50。
draw the character 'G' 画出字符“ G”
go to position 56, 50. 转到位置56、50。
draw the character 'O' 画出字符“ O”
etc 等等

Whenever you select something in a viewer (for instance Adobe), the program has to figure out what content overlaps with your selection (already this is not an easy problem). 每当您在查看器中选择某项内容（例如Adobe）时，该程序就必须找出与您的选择内容重叠的内容（这已经不是一个容易的问题了）。 If it's text, it then needs to decide where to add spaces and line-breaks. 如果是文本，则需要确定在何处添加空格和换行符。 Different viewers (or software) might use different metrics for this. 不同的查看器（或软件）可能为此使用不同的指标。 A typical one for instance is "insert a space if two characters are further apart than the width of the space character in the same font" 例如，一个典型的例子是“如果两个字符比同一字体中的空格字符的宽度更远，则插入一个空格”

The point is, getting text out of a PDF document is always kind of guesswork. 关键是，从PDF文档中提取文本始终是一种猜测。 And if you add the fact that it's an OCR PDF, you are adding a further layer of difficulties. 而且，如果您添加的事实是它是OCR PDF，那么您将添加更多的困难。