简体   繁体   English

Java:将PostScript文件转换为文本

[英]Java: Converting a PostScript File into text

Is there a Java Library that converts a PostScrpit File ".ps" into a String or TextFile (or something I can read with an InputStream)? 是否有Java库将PostScrpit文件“ .ps”转换为String或TextFile(或我可以用InputStream读取的东西)?

I have these Files and need to read them and handle them accourding to the Text in it. 我有这些文件,需要阅读并根据其中的文本进行处理。 They allway contain only Text and usually its just one line like 它们始终只包含文本,通常只包含一行,例如

date:SWYgeW91IHJlYWQgdGhpcyB5b3UncmUgcHJvYmFibGUgdG8gY3VyaW91cyAgYnV0IG5pY2UgdHJ5IGFueXdheS4gUGxlYXNlIEhlbHA= 日期:SWYgeW91IHJlYWQgdGhpcyB5b3UncmUgcHJvYmFibGUgdG8gY3VyaW91cyAgYnV0IG5pY2UgdHJ5IGFueXdheS4gUGxlYXNlIEhlbHA =

in it. 在里面。

Right now I convert it into a PDF and "read" it with an OCR Engine. 现在,我将其转换为PDF,并使用OCR引擎“读取”它。 But it seems a litte bit over the top for just one line. 但是似乎只有一点点超出了顶部。

Is there an other way to do it? 还有其他方法吗?

If you could point me in the right direction, that would be great. 如果您能指出正确的方向,那就太好了。

PostScript is a language to define graphical output on paper, to a printer device. PostScript是一种用于定义纸上图形输出到打印机设备的语言。 As such it does not really contain plaintext, and "extracting" text from it poses problems. 因此,它实际上并不包含纯文本,并且从中“提取”文本会带来问题。 It could for instance be programmatically determined in places, or it could be interspersed with PS code making the text data useless. 例如,可以在某些地方以编程方式确定它,也可以将其插入PS代码中,从而使文本数据无用。

Normally you would output a modified PS to a printer (real or virtual) with a specific config that leads the result to be output as a standard text sequence (without the graphical formatting). 通常,您将使用特定的配置将修改后的PS输出到打印机(真实或虚拟),从而导致结果以标准文本序列(无图形格式)输出。

This is often done by altering the PS code file, to alter the text output command. 这通常是通过更改PS代码文件来更改文本输出命令来完成的。

A desciption of this method can be found in part 3 of following Waikato Uni PM 可以在以下Waikato Uni PM的第3部分中找到对这种方法的描述

http://www.cs.waikato.ac.nz/~ihw/papers/98NM-Reed-IHW-Extract-Text.pdf http://www.cs.waikato.ac.nz/~ihw/papers/98NM-Reed-IHW-Extract-Text.pdf

If you convert the PostScript file to PDF (for example, with Ghostscript ps2pdf or with Acrobat Distiller), you could then read this file using iText ( http://itextpdf.com ). 如果将PostScript文件转换为PDF(例如,使用Ghostscript ps2pdf或Acrobat Distiller),则可以使用iText( http://itextpdf.com )读取此文件。 You could also convert the PDF into a more readable form using RUPS, one of the iText tools. 您还可以使用RUPS(iText工具之一)将PDF转换为更易读的形式。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM