简体   繁体   English

Ghostscript转换PDF并输出文本文件

[英]Ghostscript convert a PDF and output in a textfile

1.I need to convert a PDF File into a txt.file. 1.我需要将PDF文件转换为txt.file。 My Command seems to work, since i get the converted text on the screen, but somehow im incapable to direct the output into a textfile. 我的命令似乎工作,因为我在屏幕上获得转换后的文本,但不知何故,我无法将输出定向到文本文件。

public static string[] GetArgs(string inputPath, string outputPath)
{ 
    return new[] {
                "-q", "-dNODISPLAY", "-dSAFER",
                "-dDELAYBIND", "-dWRITESYSTEMDICT", "-dSIMPLE",
                "-c", "save", "-f",
                "ps2ascii.ps", inputPath, "-sDEVICE=txtwrite",
                String.Format("-sOutputFile={0}", outputPath),
                "-c", "quit"
    }; 
}

2.Is there a unicode speficic .ps? 2.有一个unicode speficic .ps吗?

Update: Posting my complete Code, maybe the error is somewhere else. 更新:发布我的完整代码,可能错误在其他地方。

public static string[] GetArgs(string inputPath, string outputPath)
{
    return new[]    
    {   "-o c:/test.txt",    
        "-dSIMPLE",
        "-sFONTPATH=c:/windows/fonts",
        "-dNODISPLAY",
        "-dDELAYBIND",
        "-dWRITESYSTEMDICT",
        "-f",
        "C:/Program Files/gs/gs9.05/lib/ps2ascii.ps",               
        inputPath,
    };
}

[DllImport("gsdll64.dll", EntryPoint = "gsapi_new_instance")]
private static extern int CreateAPIInstance(out IntPtr pinstance, IntPtr caller_handle);

[DllImport("gsdll64.dll", EntryPoint = "gsapi_init_with_args")]
private static extern int InitAPI(IntPtr instance, int argc, string[] argv);

[DllImport("gsdll64.dll", EntryPoint = "gsapi_exit")]
private static extern int ExitAPI(IntPtr instance);

[DllImport("gsdll64.dll", EntryPoint = "gsapi_delete_instance")]
private static extern void DeleteAPIInstance(IntPtr instance);`

private static object resourceLock = new object();

private static void Cleanup(IntPtr gsInstancePtr)
{
    ExitAPI(gsInstancePtr);
    DeleteAPIInstance(gsInstancePtr);
}

private static object resourceLock = new object();

public static void ConvertPdfToText(string inputPath, string outputPath) 
{ 
    CallAPI(GetArgs(inputPath, outputPath));
}

public static void ConvertPdfToText(string inputPath, string outputPath) 
{ 
    CallAPI(GetArgs(inputPath, outputPath));
}

private static void CallAPI(string[] args)      
{       
    // Get a pointer to an instance of the Ghostscript API and run the API with the current arguments       
    IntPtr gsInstancePtr;   
    lock (resourceLock)     
    {           
        CreateAPIInstance(out gsInstancePtr, IntPtr.Zero);      
        try
        {
            int result = InitAPI(gsInstancePtr, args.Length, args);                    
            if (result < 0)     
            {
                throw new ExternalException("Ghostscript conversion error", result);        
            }       
        }           
        finally     
        {               
            Cleanup(gsInstancePtr);     
        }       
    }   
}

2 questions, 2 answers: 2个问题,2个答案:

  1. To get output to a file, use -sOutputFile=/path/to/file on the commandline, or add the line 要获取输出到文件,请在命令行上使用-sOutputFile=/path/to/file ,或添加行

     "-sOutputFile=/where/it/should/go", 

    to your c# code (can be the first argument, but should be before your first "-c" . But first get rid of your other -sOutputFile stuff you have already in there... :-) 你的c#代码(可以是第一个参数,但应该在你的第一个"-c" 。但首先要摆脱你已经在那里的其他-sOutputFile东西...... :-)

  2. No, PostScript isn't aware of Unicode. 不,PostScript不了解Unicode。


Update 更新

(Remark: Extracting text from PDF reliably is (for various technical reasons) notoriously difficult. And it may not work at all, whichever tool you try...) (备注:可靠地从PDF中提取文本(出于各种技术原因)非常困难。它可能根本不起作用,无论你尝试哪种工具......)

On the commandline, the following two should work for recent releases of Ghostscript (current version is v9.05). 在命令行上,以下两个应该适用于最近发布的Ghostscript(当前版本是v9.05)。 It would be your own job... 这将是你自己的工作......

  • ...to test which command works better for your use case, and ...测试哪个命令更适合您的用例,以及
  • ...to translate these into c# code. ...将这些转换为c#代码。

1. txtwrite device: 1. txtwrite设备:

gswin32c.exe ^
   -o c:/path/to/output.txt ^
   -dTextFormat=3 ^
   -sDEVICE=txtwrite ^
    input.pdf

Notes: 笔记:

  1. You may want to use gswin64c.exe (if available) on your system if it is 64bit. 如果是64位,您可能希望在系统上使用gswin64c.exe (如果可用)。
  2. The -o syntax for the output works only with recent versions of Ghostscript. 输出的-o语法仅适用于最新版本的Ghostscript。
  3. The -o syntax does implicitely also set the -dBATCH and -dNOPAUSE parameters. -o语法也隐含地设置了-dBATCH-dNOPAUSE参数。
  4. If your Ghostscript is too old and the -o shorthand doesn't work, replace it with -dBATCH -dNOPAUSE -sOutputFile=... . 如果您的Ghostscript太旧且-o速记不起作用,请将其替换为-dBATCH -dNOPAUSE -sOutputFile=...
  5. Ghostscript can handle forward slashes inside path arguments even on Windows. 即使在Windows上,Ghostscript也可以在路径参数内处理正斜杠。
  6. The -dTextFormat is by default set to 3 anyway, so it is not required here. -dTextFormat默认设置为3 ,因此这里不需要它。 'Legal' values for it are: '合法'的价值观是:
    • 0 : This outputs XML-escaped Unicode along with info related to the format of the text (position, font name, point size, etc). 0 :这将输出XML转义的Unicode以及与文本格式相关的信息(位置,字体名称,磅值等)。 Intended for developers only. 仅供开发人员使用。
    • 1 : Same as 0 , but will output blocks of text. 1 :与0相同,但会输出文本块。
    • 2 : This outputs Unicode (UCS2) text with BMO (Byte Order Mark); 2 :输出带有BMO(字节顺序标记)的Unicode(UCS2)文本; tries to approximate layout of text in original document. 尝试近似原始文档中的文本布局。
    • 3 : (default) Same as 2 , but the text is encoded in UTF-8. 3 :( 默认)2相同,但文本以UTF-8编码。
  7. The txtwrite device with this -dTextFormat modifier is a rather new asset of Ghostscript, so please report bugs if you find ones. 带有这个-dTextFormat修饰符的txtwrite设备是Ghostscript的一个相当新的资产,所以如果找到bug ,请报告错误

2. Using ps2ascii.ps 2.使用ps2ascii.ps

gswin32c.exe ^
   -sstdout=c:/path/to/output.txt ^
   -dSIMPLE ^
   -sFONTPATH=c:/windows/fonts ^
   -dNODISPLAY 
   -dDELAYBIND ^
   -dWRITESYSTEMDICT ^
   -f /path/to/ps2ascii.ps ^
    input.pdf

Notes: 笔记:

  1. This is a completely different method from the txtwrite device one and cannot be mixed with it! 这是一种与txtwrite设备完全不同的方法,不能与它混合使用!
  2. ps2ascii.ps is a file , a PostScript program that Ghostscript invokes to extract the text. ps2ascii.ps是一个文件 ,是Ghostscript调用以提取文本的PostScript程序。 It is usually located in the Ghostscript installdir's /lib subdirectory. 它通常位于Ghostscript installdir的/lib子目录中。 Go and see if it is really there. 去看看它是否真的存在。
  3. -dSIMPLE may be replaced by dCOMPLEX in order to print out extra info lines (current color, presence of an image, rectangular fills). -dSIMPLE可以由dCOMPLEX替换,以打印出额外的信息行(当前颜色,图像的存在,矩形填充)。
  4. -sstdout=... is required because the ps2ascii.ps PostScript program does print to stdout only and can't be told to write to a file. -sstdout=...是必需的,因为ps2ascii.ps PostScript程序仅打印到stdout,不能告诉写入文件。 So -sstdout=... tells Ghostscript to redirect its stdout to a file. 所以-sstdout=...告诉Ghostscript将其stdout重定向到一个文件。

3. Non-Ghostscript methods 3.非Ghostscript方法

Do not ignore other, non-Ghostscript methods that may be easier to work with. 不要忽略可能更容易使用的其他非Ghostscript方法。 All of the following are cross-platform and should be available on Windows too: 以下所有内容都是跨平台的,也应该在Windows上可用:

  • mudraw -t
    GPL licensed (or commercial, if you need). GPL许可(或商业,如果您需要)。 Commandline utility from MuPDF to extract text from PDF (which is developed by the same group of developers that do Ghostscript). MuPDF的命令行实用程序从PDF中提取文本(由开发Ghostscript的同一组开发人员开发)。
  • pdftotext
    GPL licensed. GPL许可。 Commandline utility from Poppler (which is a fork from XPDF , that also provides a pdftotext ). Poppler的命令行实用程序(它是XPDF的一个分支,也提供了pdftotext )。
  • podofotxtextract
    GPL licensed. GPL许可。 Commandline utility based the PoDoFo PDF processing library. Commandline实用程序基于PoDoFo PDF处理库。
  • TET TET
    The Text Extraction Toolkit from PDFlib.com (commercial, but may be gratis for personal use -- I didn't check recent news). 来自PDFlib.com的文本提取工具包 (商业,但可能是免费供个人使用 - 我没有检查最近的新闻)。 Probably the most powerful text extraction tool of them all... 可能是它们中最强大的文本提取工具......

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM