简体   繁体   English

Ghostscript将pdf转换为文本并保持PDF文件表格式

[英]Ghostscript to convert pdf to text and keep PDF file table format

I have this code that converts a PDF to Text file: 我有以下代码将PDF转换为文本文件:

gswin32c -dBATCH -dNOPAUSE -dSAFER -dDELAYBIND -dWRITESYSTEMDICT 
-dSIMPLE -sDEVICE=txtwrite -dTextFormat=2 -dFirstPage=1 -dLastPage=1 
-sOutputFile=C:\out.txt C:\in.pdf

It works almost fine, the only thing it does not keep the PDF table formatting. 它几乎可以正常工作,唯一不保留PDF表格格式的东西。

Example: 例:

In the PDF file: 在PDF文件中:

Type    From        Name             Name2                   Code         Week
Regular 30/03/15    KNOWLES, BEN     HOOT KNOWLES, ANGELA    367-739-746  80.00       
Regular 30/03/15    RICHARDS, COLE   ROBERT HARRIS, BRADIE   401-844-307  108.00      
Regular 30/03/15    SKEELS, MATT     BISHOP, JASON GREGSON   413-980-291  112.00

After convert it to text file, the text gets wrapped like this: 将其转换为文本文件后,文本将像这样包装:

Type From Name Name2 Code Week
Regular30/03/15KNOWLES, BENHOOT KNOWLES, ANGELA367-739-74680.00       
Regular30/03/15RICHARDS, COLEROBERT HARRIS, BRADIE401-844-307108.00      
Regular30/03/15SKEELS, MATTBISHOP, JASON GREGSON413-980-291112.00

I need it to keep its formatting. 我需要它来保持其格式。 Any idea how to keep the formatting? 任何想法如何保持格式?

I am using Ghostscript gswin32c on windows 7 machine, version is 9.16. 我在Windows 7计算机上使用Ghostscript gswin32c ,版本为9.16。

Also, I am open to suggestions for others way to archive it. 另外,我也乐意接受其他归档方法的建议。

Cheers 干杯

There isn't a 'table format' in PDf, just a sequence of text and positions. PDf中没有“表格格式”,只有一系列文本和位置。 One of the possible output formats for txtwrite attempts to make a Unicode text file, where the spacing is re-created by space characters. txtwrite的一种可能的输出格式尝试制作一个Unicode文本文件,其中空格由空格字符重新创建。 Note that this assumes a fixed-pitch font, so it won't work well if you don't use one. 请注意,这是一种固定间距的字体,因此,如果您不使用一种字体,它将无法正常工作。

Without seeing the input PDF file its not really possible to make any guesses as to why this isn't producing output as you expect. 如果没有看到输入的PDF文件,那么实际上就不可能对为什么它没有产生预期的输出做出任何猜测。

You can tackle this problem yourself. 您可以自己解决此问题。 Firstly because there are other potential output formats, one of them is an XML-like format which emits the text sequences and positions, you could use that and recreate the format yourself (or even just archive it directly). 首先,因为还有其他潜在的输出格式,其中一种是类似于XML的格式,可以发出文本序列和位置,因此您可以使用它并自己重新创建格式(甚至直接将其存档)。 Alternatively, since Ghostscript is open-source, you could read and debug the source yourself and figure out why your PDF file is causing a problem. 另外,由于Ghostscript是开源的,因此您可以自己阅读和调试源码,并弄清楚为什么PDF文件引起了问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM