简体   繁体   English

如何使用ghostscript将pdf文件中的行读取到ac程序中?

[英]How to read lines from a pdf file into a c program using ghostscript?

I am currently taking a curse in C programming, and for our final project we need to read some text from a pdf into a string, so we can manipulate the string. 我目前正在诅咒C编程,对于我们的最终项目,我们需要将pdf中的一些文本读入字符串中,以便我们可以操作该字符串。

In essence what i am looking for is something similar to this, only with a .pdf instead of a .txt file. 本质上,我要寻找的是与此类似的东西,仅使用.pdf而不是.txt文件。

  char *line;
  fscanf(myfile.txt," %[^\n]", line);

I have no experience with ghostscript, so I have no idea if this is even possible, although we where told that we should use ghostscript. 我没有使用Ghostscript的经验,所以我什至不知道这是否可行,尽管我们告诉我们应该使用Ghostscript。

The current version of Ghostscript includes the 'txtwrite' device, which will extract text from any supported input (PostScript, PDF, XPS, PCL) and will emit it in a variety of forms. 当前版本的Ghostscript包括“ txtwrite”设备,该设备将从任何受支持的输入(PostScript,PDF,XPS,PCL)中提取文本,并以多种形式发出文本。

The UTF-8 output would probably be most useful to you. UTF-8输出可能对您最有用。

Caveat! 警告! Many things which appear to be text in PDF files are not text, and no attempt is made to deal with these. 在PDF文件中看似文本的许多东西都不是文本,因此未尝试对其进行处理。

ps2ascii is deprecated with the release of the txtwrite device, but in any case its perfectly capable (despite the name) of dealing with PDF as an input. txtwrite设备的发布不推荐使用ps2ascii,但是无论如何,ps2ascii完全有能力(尽管有名称)将PDF作为输入。

I can't think why anyone assigned you this project, PDF files are not text files, and cannot be treated as such. 我想不出为什么有人给您分配了这个项目,PDF文件不是文本文件,并且不能这样处理。 In addition to the fact that PDF files are generally compressed, identifying the contents stream and all the other streams it relies on (which may themselves include text) is non-trivial. 除了通常压缩PDF文件这一事实之外,识别内容流及其依赖的所有其他流(它们本身可能包括文本)并非易事。 Plus, the text is often encoded in a way which can be difficult to understand (this is particularly true of CIDFonts and TrueType fonts). 另外,文本通常以一种难以理解的方式进行编码(对于CIDFonts和TrueType字体尤其如此)。

Perhaps your tutor expected you to first become expert in the PDF format, but that seems excessive for a C course. 也许您的导师希望您首先成为PDF格式的专家,但这对于C课程来说似乎太过分了。

You can convert your PDF to Postscript using pdf2ps , and then to ASCII using ps2ascii . 您可以使用pdf2ps将PDF转换为Postscript,然后使用ps2ascii转换为ASCII。 You already know how to read ASCII. 您已经知道如何读取ASCII。

Both utilities mentioned are in the ghostscript package. 提到的两个实用程序都在ghostscript软件包中。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM