这是什么（cid：51）在pdf2txt的输出中？

Question

So i'm trying to extract the text from a pdf file, I need its position, width, height, font. 所以我试图从pdf文件中提取文本，我需要它的位置，宽度，高度，字体。

I have tried many, but the most useful and complete solution looks to be PDFMiner , and in this case, more exactly pdf2txt.py . 我尝试了很多，但最有用和最完整的解决方案看起来是PDFMiner ，在这种情况下，更确切地说是pdf2txt.py 。

I have followed the doc and the examples and tried to extract the text Learn More from my pdf using this command: 我按照文档和例子，并试图提取文本Learn More使用此命令我的pdf文件：

pdf2txt.py -Y normal -t xml -o buttons.xml buttons.pdf

And the output buttons.xml looks like that: 输出buttons.xml看起来像这样：

<?xml version="1.0" encoding="utf-8" ?>
  <pages>
      <page id="1" bbox="0.000,0.000,799.900,449.944" rotate="0">
      <textbox id="0" bbox="164.979,213.240,247.680,235.944">
          <textline bbox="164.979,213.240,247.680,235.944">
              <text font="KZNUUP+HelveticaNeue-Bold" bbox="164.979,213.240,178.978,235.944" size="22.704">(cid:51)</text>
              <text font="KZNUUP+HelveticaNeue-Bold" bbox="173.280,213.240,187.278,235.944" size="22.704">(cid:76)</text>
              <text font="KZNUUP+HelveticaNeue-Bold" bbox="181.315,213.240,195.313,235.944" size="22.704">(cid:72)</text>
              <text font="KZNUUP+HelveticaNeue-Bold" bbox="189.350,213.240,203.348,235.944" size="22.704">(cid:89)</text>
              <text font="KZNUUP+HelveticaNeue-Bold" bbox="194.795,213.240,208.793,235.944" size="22.704">(cid:85)</text>
              <text font="KZNUUP+HelveticaNeue-Bold" bbox="203.096,213.240,217.094,235.944" size="22.704">(cid:3)</text>
              <text font="KZNUUP+HelveticaNeue-Bold" bbox="206.987,213.240,220.986,235.944" size="22.704">(cid:52)</text>
              <text font="KZNUUP+HelveticaNeue-Bold" bbox="219.684,213.240,233.682,235.944" size="22.704">(cid:86)</text>
              <text font="KZNUUP+HelveticaNeue-Bold" bbox="228.237,213.240,242.235,235.944" size="22.704">(cid:89)</text>
              <text font="KZNUUP+HelveticaNeue-Bold" bbox="233.682,213.240,247.680,235.944" size="22.704">(cid:76)</text>
              <text></text>
          </textline>
          </textbox>
          <textgroup bbox="164.979,213.240,419.659,235.944">
              <textbox id="0" bbox="164.979,213.240,247.680,235.944" />
          </textgroup>
      </page>
  </pages>

The first character should be a L and 51 (cid:51) doesn't seem to match any of the character i have in my sentence, regarding the ascii table and the utf-8 table 对于ascii表和utf-8表，第一个字符应该是L和51 (cid:51)似乎与我在句子中的任何字符都不匹配

So as the title says, I wonder what is it, and how to use these (cid:51)... ? 正如标题所说，我想知道它是什么，以及如何使用它们(cid:51)... ？

EDIT 编辑

So I found that instead of getting the real character the program write (cid:%d) because he doesn't recognize that it's a unicode string. 所以我发现程序没有得到真正的字符（cid：％d），因为他没有认识到它是一个unicode字符串。

It first call this function to write the char: 它首先调用此函数来编写char：

def render_char(self, matrix, font, fontsize, scaling, rise, cid):
    try:
        text = font.to_unichr(cid)
        assert isinstance(text, unicode), text
    except PDFUnicodeNotDefined:
        text = self.handle_undefined_char(font, cid)

But the assert fail and fire the event PDFUnicodeNotDefined which is caught and calls: 但assert失败并触发事件PDFUnicodeNotDefined被捕获并调用：

def handle_undefined_char(self, font, cid):
    if self.debug:
        print >>sys.stderr, 'undefined: %r, %r' % (font, cid)
    return '(cid:%d)' % cid

And that's how I end with a file containing all these (cid:%d). 这就是我以包含所有这些（cid：％d）的文件结束的方式。

I'm fairly new to python and I try to figure out a way to recognize these chars, it should be one no ? 我是python的新手，我试图想出一种方法来识别这些字符，它应该是一个不？ Does anyone has any idea ? 有没有人有任何想法？

Answer 1

to understand how to interpret the cid you need to know a pair of things: 要了解如何解释你需要知道一对事物的cid ：

The Registry-Ordering-Supplement (ROS) information for the font in question. 有关字体的Registry-Ordering-Supplement（ROS）信息。 It's usually something like 'Adobe-Japan1-5' and is an informational property stored in the font. 它通常类似于“Adobe-Japan1-5”，是存储在字体中的信息属性。 The ROS determines how the CIDs are to be interpreted. ROS确定如何解释CID。
Armed with the ROS info, select a compatible CMap and decode through that.You can find CMap files for the Adobe-defined ROSes at http://sourceforge.net/projects/cmap.adobe/files/ 使用ROS信息，选择兼容的CMap并通过它进行解码。您可以在http://sourceforge.net/projects/cmap.adobe/files/找到Adobe定义的ROS的CMap文件。

More information on CID and CMaps direct from the inventors is available at http://www.adobe.com/content/dam/Adobe/en/devnet/font/pdfs/5014.CIDFont_Spec.pdf 有关发明人直接提供的CID和CMaps的更多信息，请访问http://www.adobe.com/content/dam/Adobe/en/devnet/font/pdfs/5014.CIDFont_Spec.pdf

check decode CID font codes to equivalent ASCII characters for more information 检查将CID字体代码解码为等效的ASCII字符以获取更多信息

这是什么（cid：51）在pdf2txt的输出中？

问题描述

EDIT 编辑

1 个解决方案

解决方案1
0 2018-03-13 20:32:28

这是什么（cid：51）在pdf2txt的输出中？

问题描述

EDIT 编辑

1 个解决方案

解决方案1 0 2018-03-13 20:32:28

解决方案1
0 2018-03-13 20:32:28