简体   繁体   English

c# PDF免费转Bmp

[英]c# PDF to Bmp for free

I am writing a program that uses OCR (tessnet2) to scan an image file and extract certain information.我正在编写一个使用 OCR (tessnet2) 扫描图像文件并提取某些信息的程序。 This was easy before I found out that I was going to be scanning attachments of PDFs from an Exchange server.在我发现要从 Exchange 服务器扫描 PDF 附件之前,这很容易。

The first problem I am working on is how to convert my PDFs to BMP files.我正在处理的第一个问题是如何将我的 PDF 转换为 BMP 文件。 From what I can tell so far of TessNet2, it can only read in image files - specifically BMP.就我目前所知的 TessNet2 而言,它只能读取图像文件——特别是 BMP。 So I am now tasked with converting a PDF of indeterminate size (2 - 15 pages) to BMP image.所以我现在的任务是将不确定大小(2 - 15 页)的 PDF 转换为 BMP 图像。 After that is done I can easily scan each image using the code I have built already with TessNet2.完成后,我可以使用我已经用 TessNet2 构建的代码轻松扫描每个图像。

I have seen things using Ghostscript to do this task - i'm just wondering if there was another free solution or if one of you fine humans could give me a crash course on how to do this using Ghostscript.我已经看到使用 Ghostscript 来完成这项任务的事情——我只是想知道是否有另一种免费的解决方案,或者你们中的一个优秀的人是否可以给我一个关于如何使用 Ghostscript 执行此操作的速成课程。

Found a CodeProject article on converting PDFs to Images: 找到有关将PDF转换为图像的CodeProject文章:

http://www.codeproject.com/Articles/57100/Simple-and-Free-PDF-to-Image-Conversion http://www.codeproject.com/Articles/57100/Simple-and-Free-PDF-to-Image-Conversion

You can use ImageMagick too. 您也可以使用ImageMagick And it's totally free! 它完全免费! No trial or payment. 没有试用或付款。

Just download the ImageMagick .exe from here . 只需从这里下载ImageMagick .exe。 Install it and download the NuGet file in here . 安装它并在此处下载NuGet文件。

There is the code! 有代码! Hope I helped! 希望我帮忙! (even though the question was made 6 years ago...) (尽管问题是在6年前提出的......)

Procedure: 程序:

     using ImageMagick;
     public void PDFToBMP(string output)
     {
        MagickReadSettings settings = new MagickReadSettings();
        // Settings the density to 500 dpi will create an image with a better quality
        settings.Density = new Density(500);

        string[] files= GetFiles();
        foreach (string file in files)
        {
            string fichwithout = Path.GetFileNameWithoutExtension(file);
            string path = Path.Combine(output, fichwithout);
            using (MagickImageCollection images = new MagickImageCollection())
            {
                images.Read(fich);
                foreach (MagickImage image in images)
                {
                    settings.Height = image.Height;
                    settings.Width = image.Width;
                    image.Format = MagickFormat.Bmp; //if you want to do other formats of image, just change the extension here! 
                    image.Write(path + ".bmp"); //and here!
                }
            }
        }
    }

Function GetFiles() : 函数GetFiles()

    public string[] GetFiles()
    {
        if (!Directory.Exists(@"your\path"))
        {
            Directory.CreateDirectory(@"your\path");
        }

        DirectoryInfo dirInfo = new DirectoryInfo(@"your\path");
        FileInfo[] fileInfos = dirInfo.GetFiles();
        ArrayList list = new ArrayList();
        foreach (FileInfo info in fileInfos)
        {
            if(info.Name != file)
            {
                // HACK: Just skip the protected samples file...
                if (info.Name.IndexOf("protected") == -1)
                    list.Add(info.FullName);
            }

        }
        return (string[])list.ToArray(typeof(string));
    }

I recognize this is a very old question, but it is an ongoing problem.我承认这是一个非常古老的问题,但这是一个持续存在的问题。 If you are targeting .NET 6 or later, I hope you would take a look at my library Melville.PDF .如果您的目标是 .NET 6 或更高版本,我希望您看看我的图书馆Melville.PDF

Melville.Pdf is a MIT-Licensed C# implementation of a PDF renderer. Melville.Pdf 是 PDF 渲染器的 MIT 许可 C# 实现。 I hope this serves a need that I have felt for some time.我希望这能满足我一段时间以来的需求。

If you are trying to get text out of PDF documents, render + OCR may be the hard way arround.如果您尝试从 PDF 文档中获取文本,则渲染 + OCR 可能是最困难的方法。 Some PDF files are just a thin wrapper around image objects, but many actually have text inside of them.一些 PDF 文件只是图像对象的薄包装,但实际上许多文件内部都有文本。 Melville.PDF does not do text extraction (yet) but it might be an easier way to get text out of some files. Melville.PDF(还)不进行文本提取,但它可能是从某些文件中获取文本的更简单方法。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM