[英]Ghostscript to convert pdf to text and keep PDF file table format
I have this code that converts a PDF to Text file: 我有以下代码将PDF转换为文本文件:
gswin32c -dBATCH -dNOPAUSE -dSAFER -dDELAYBIND -dWRITESYSTEMDICT
-dSIMPLE -sDEVICE=txtwrite -dTextFormat=2 -dFirstPage=1 -dLastPage=1
-sOutputFile=C:\out.txt C:\in.pdf
It works almost fine, the only thing it does not keep the PDF table formatting. 它几乎可以正常工作,唯一不保留PDF表格格式的东西。
Example: 例:
In the PDF file: 在PDF文件中:
Type From Name Name2 Code Week
Regular 30/03/15 KNOWLES, BEN HOOT KNOWLES, ANGELA 367-739-746 80.00
Regular 30/03/15 RICHARDS, COLE ROBERT HARRIS, BRADIE 401-844-307 108.00
Regular 30/03/15 SKEELS, MATT BISHOP, JASON GREGSON 413-980-291 112.00
After convert it to text file, the text gets wrapped like this: 将其转换为文本文件后,文本将像这样包装:
Type From Name Name2 Code Week
Regular30/03/15KNOWLES, BENHOOT KNOWLES, ANGELA367-739-74680.00
Regular30/03/15RICHARDS, COLEROBERT HARRIS, BRADIE401-844-307108.00
Regular30/03/15SKEELS, MATTBISHOP, JASON GREGSON413-980-291112.00
I need it to keep its formatting. 我需要它来保持其格式。 Any idea how to keep the formatting? 任何想法如何保持格式?
I am using Ghostscript gswin32c
on windows 7 machine, version is 9.16. 我在Windows 7计算机上使用Ghostscript gswin32c
,版本为9.16。
Also, I am open to suggestions for others way to archive it. 另外,我也乐意接受其他归档方法的建议。
Cheers 干杯
There isn't a 'table format' in PDf, just a sequence of text and positions. PDf中没有“表格格式”,只有一系列文本和位置。 One of the possible output formats for txtwrite attempts to make a Unicode text file, where the spacing is re-created by space characters. txtwrite的一种可能的输出格式尝试制作一个Unicode文本文件,其中空格由空格字符重新创建。 Note that this assumes a fixed-pitch font, so it won't work well if you don't use one. 请注意,这是一种固定间距的字体,因此,如果您不使用一种字体,它将无法正常工作。
Without seeing the input PDF file its not really possible to make any guesses as to why this isn't producing output as you expect. 如果没有看到输入的PDF文件,那么实际上就不可能对为什么它没有产生预期的输出做出任何猜测。
You can tackle this problem yourself. 您可以自己解决此问题。 Firstly because there are other potential output formats, one of them is an XML-like format which emits the text sequences and positions, you could use that and recreate the format yourself (or even just archive it directly). 首先,因为还有其他潜在的输出格式,其中一种是类似于XML的格式,可以发出文本序列和位置,因此您可以使用它并自己重新创建格式(甚至直接将其存档)。 Alternatively, since Ghostscript is open-source, you could read and debug the source yourself and figure out why your PDF file is causing a problem. 另外,由于Ghostscript是开源的,因此您可以自己阅读和调试源码,并弄清楚为什么PDF文件引起了问题。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.