简体   繁体   English

如何从 C 语言的 PDF 文件中读取字符串?

[英]How can I read a string from PDF file in C?

I want to create a program that process the edit distance from two file, My code works with strings read from a txt file.我想创建一个程序来处理两个文件的编辑距离,我的代码使用从 txt 文件读取的字符串。 But now I want to read strings from PDF DOC exc.但现在我想从 PDF DOC exc 中读取字符串。 How can I read strings from this files?如何从此文件中读取字符串? I tryed with the func fread but it not works.我尝试使用 func fread 但它不起作用。 This is the code that i wrote:这是我写的代码:

void method () {
FILE *file;
char *str;
if ((file = fopen("C:/Users/latin/Desktop/prova.pdf", "rb")) == NULL) {
    printf("Error!\n");
}
fread(&str,18,1,file);
printf("%s",str);
}

prova.pdf is a PDF file that contains this string : ciaoCiao merendina . prova.pdf是一个包含以下字符串的 PDF 文件: ciaoCiao merendina

It is possible to do this in plain C. Adobe did it.可以用普通的 C 来做到这一点。 Adob​​e 做到了。 Artifex did it. Artifex 做到了。 Others have done it.其他人已经做到了。 But as commented, it is a ton of work.但正如评论的那样,这是一项繁重的工作。 But I can outline the steps to give you a feel for what's involved.但我可以概述步骤,让您了解所涉及的内容。

First you could read the "Magic Number" at the start and check that it is actually a PDF.首先,您可以在开头阅读“魔术数字”并检查它是否实际上是 PDF。 It should start with %PDF- followed by a version number.它应该以%PDF-开头,后跟版本号。 But apparently many PDF producers don't conform to this requirement.但显然许多 PDF 制作者不符合这一要求。

Next, you need to skip to the very end of the file and read backwards, looking for something like:接下来,您需要跳到文件的最后并向后阅读,寻找类似的内容:

startxref
1581
%%EOF

That number is the byte-offset of the start of the X-Reference table which lists the binary offsets of all the "objects" in the file.该数字是 X-Reference 表开头的字节偏移量,该表列出了文件中所有“对象”的二进制偏移量。 An object can be a Page or a Font or a Content Stream or many other things.对象可以是页面或字体或内容流或许多其他东西。

Looking at the X-Reference table, you'll see something like this:查看 X-Reference 表,您会看到如下内容:

xref
0 4
0000000000 65535 f 
0000000010 00000 n 
0000000063 00000 n 
0000000127 00000 n 
0000000234 00000 n 
trailer
<<
  /Root 1 0 R
  /Size 4
>>

The line /Root 1 0 R tells you which object is the root of the document tree./Root 1 0 R告诉您哪个对象是文档树的根。 You'll need to examine this object to find the top-level Pages object which looks like this:您需要检查此对象以找到如下所示的顶级 Pages 对象:

2 0 obj
<< /Kids [ 3 0 R ] 
/Type /Pages 
/Count 1 
>> 
endobj

The Kids element here contains a reference to the first Page object which looks like this:此处的 Kids 元素包含对第一个 Page 对象的引用,如下所示:

3 0 obj
<< /Contents [ 4 0 R ] 
/MediaBox [ 0.0 0.0 612.0 792.0 ] 
/Type /Page 
/Parent 2 0 R 
>> 
endobj

Then you'll need to find the Contents object referenced here.然后您需要找到此处引用的 Contents 对象。 A Content stream, if it's not encrypted or compressed, will show you the drawing commands and text commands being drawn to the page.内容流(如果未加密或压缩)将向您显示绘制到页面的绘图命令和文本命令。

5 0 obj
<<
  /Length 15660 
>>
stream
BT F1 10.0 Tf 30.0 750.0 Td (<< ) Tj ET BT F1 10.0 Tf 50.0 738.0 Td (/) 
Tj ET BT F1 10.0 Tf 56.0586 738.0 Td (astring) Tj ET BT F1 10.0 Tf 86.7852 
738.0 Td ( ) Tj ET BT F1 10.0 Tf 89.2852 738.0 Td (\() Tj ET BT F1 10.0 Tf 
92.6133 738.0 Td (this string data) Tj ET 
[...lots more commands follow...]
endstream
endobj

Text commands will always be bracketed by BT ... ET .文本命令将始终被BT ... ET括起来。 In here, you can finally see the strings wrapped in parens.在这里,您终于可以看到包裹在括号中的字符串。 But you'll have to pay attention to the coordinates 30.0 750.0 Td of each string to figure out which ones are part of the same logical line.但是您必须注意每个字符串的坐标30.0 750.0 Td才能确定哪些是同一逻辑线的一部分。

If the PDF was created from a word processor, it is likely to contain text in this form but with lots of caveats.如果 PDF 是从文字处理器创建的,它很可能包含这种形式的文本,但有很多注意事项。 It might have re-encoded fonts and the text strings will no longer represent ASCII characters but just positions in the font's encoding vector.它可能对字体进行了重新编码,文本字符串将不再代表 ASCII 字符,而只是字体编码向量中的位置。 If the PDF was created from a scanned document, it may just contain images of the pages with no text content at all unless it has gone through a conversion involving OCR.如果 PDF 是从扫描文档创建的,它可能只包含页面的图像而根本没有文本内容,除非它经过了涉及 OCR 的转换。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM