简体   繁体   English

在没有 Microsoft.Office.Interop 的 .NET Core 中将 Word doc 和 docx 格式转换为 PDF

[英]Convert Word doc and docx format to PDF in .NET Core without Microsoft.Office.Interop

I need to display Word .doc and .docx files in a browser.我需要在浏览器中显示 Word .doc.docx文件。 There's no real client-side way to do this and these documents can't be shared with Google docs or Microsoft Office 365 for legal reasons.没有真正的客户端方式来执行此操作,并且出于法律原因,这些文档不能与 Google 文档或 Microsoft Office 365 共享。

Browsers can't display Word, but can display PDF, so I want to convert these docs to PDF on the server and then display that.浏览器无法显示 Word,但可以显示 PDF,所以我想在服务器上将这些文档转换为 PDF,然后显示。

I know this can be done using Microsoft.Office.Interop.Word , but my application is .NET Core and does not have access to Office interop.我知道这可以使用Microsoft.Office.Interop.Word来完成,但我的应用程序是 .NET Core 并且无法访问 Office 互操作。 It could be running on Azure, but it could also be running in a Docker container on anything else.它可以在 Azure 上运行,但也可以在其他任何东西上的 Docker 容器中运行。

There appear to be lots of similar questions to this, however most are asking about full- framework .NET or assuming that the server is a Windows OS and any answer is no use to me.似乎有很多类似的问题,但是大多数人都在询问完整框架的 .NET 或假设服务器是 Windows 操作系统,任何答案对我来说都没用。

How do I convert .doc and .docx files to .pdf without access to Microsoft.Office.Interop.Word ?如何在访问Microsoft.Office.Interop.Word的情况下将.doc.docx文件转换为.pdf

This was such a pain, no wonder all the third party solutions are charging $500 per developer.这太痛苦了,难怪所有第三方解决方案都向每位开发人员收取 500 美元的费用。

Good news is the Open XML SDK recently added support for .Net Standard so it looks like you're in luck with the .docx format.好消息是Open XML SDK 最近添加了对 .Net Standard 的支持,因此看起来您对.docx格式很幸运。

Bad news at the moment there isn't a lot of choice for PDF generation libraries on .NET Core.目前的坏消息是 .NET Core 上的 PDF 生成库没有太多选择。 Since it doesn't look like you want to pay for one and you can't legally use a third party service we have little choice except to roll our own.由于看起来您不想付费,而且您不能合法使用第三方服务,因此我们别无选择,只能自己动手。

The main problem is getting the Word Document Content transformed to PDF.主要问题是将 Word 文档内容转换为 PDF。 One of the popular ways is reading the Docx into HTML and exporting that to PDF.一种流行的方法是将 Docx 读入 HTML 并将其导出为 PDF。 It was hard to find, but there is .Net Core version of the OpenXMLSDK- PowerTools that supports transforming Docx to HTML.很难找到,但是 OpenXMLSDK- PowerTools的 .Net Core 版本支持将 Docx 转换为 HTML。 The Pull Request is "about to be accepted", you can get it from here:拉取请求“即将被接受”,您可以从这里获取:

https://github.com/OfficeDev/Open-Xml-PowerTools/tree/abfbaac510d0d60e2f492503c60ef897247716cf https://github.com/OfficeDev/Open-Xml-PowerTools/tree/abfbaac510d0d60e2f492503c60ef897247716cf

Now that we can extract document content to HTML we need to convert it to PDF.现在我们可以将文档内容提取为 HTML,我们需要将其转换为 PDF。 There are a few libraries to convert HTML to PDF, for example DinkToPdf is a cross-platform wrapper around the Webkit HTML to PDF library libwkhtmltox.有一些库可以将 HTML 转换为 PDF,例如DinkToPdf是 Webkit HTML 到 PDF 库 libwkhtmltox 的跨平台包装器。

I thought DinkToPdf was better than https://code.msdn.microsoft.com/How-to-export-HTML-to-PDF-c5afd0ce我认为 DinkToPdf 比https://code.msdn.microsoft.com/How-to-export-HTML-to-PDF-c5afd0ce更好


Docx to HTML Docx 到 HTML

Let's put this altogether, download the OpenXMLSDK-PowerTools .Net Core project and build it (just the OpenXMLPowerTools.Core and the OpenXMLPowerTools.Core.Example - ignore the other project).让我们把它放在一起,下载 OpenXMLSDK-PowerTools .Net Core 项目并构建它(只是 OpenXMLPowerTools.Core 和 OpenXMLPowerTools.Core.Example - 忽略其他项目)。

Set the OpenXMLPowerTools.Core.Example as StartUp project.将 OpenXMLPowerTools.Core.Example 设置为 StartUp 项目。 Add a Word Document to the project (eg test.docx) and set this docx files properties Copy To Output = If Newer将 Word 文档添加到项目(例如 test.docx)并设置此 docx 文件属性Copy To Output = If Newer

Run the console project:运行控制台项目:

static void Main(string[] args)
{
    var source = Package.Open(@"test.docx");
    var document = WordprocessingDocument.Open(source);
    HtmlConverterSettings settings = new HtmlConverterSettings();
    XElement html = HtmlConverter.ConvertToHtml(document, settings);

    Console.WriteLine(html.ToString());
    var writer = File.CreateText("test.html");
    writer.WriteLine(html.ToString());
    writer.Dispose();
    Console.ReadLine();

Make sure the test.docx is a valid word document with some text otherwise you might get an error:确保 test.docx 是包含一些文本的有效 word 文档,否则您可能会收到错误:

the specified package is invalid.指定的包无效。 the main part is missing主要部分不见了

If you run the project you will see the HTML looks almost exactly like the content in the Word document:如果您运行该项目,您将看到 HTML 看起来几乎与 Word 文档中的内容一模一样:

在此处输入图像描述

However if you try a Word Document with pictures or links you will notice they're missing or broken.但是,如果您尝试使用带有图片或链接的 Word 文档,您会发现它们丢失或损坏。

This CodeProject article addresses these issues: https://www.codeproject.com/Articles/1162184/Csharp-Docx-to-HTML-to-Docx这篇 CodeProject 文章解决了这些问题: https ://www.codeproject.com/Articles/1162184/Csharp-Docx-to-HTML-to-Docx

I had to change the static Uri FixUri(string brokenUri) method to return a Uri and I added user friendly error messages.我不得不更改static Uri FixUri(string brokenUri)方法以返回Uri ,并添加了用户友好的错误消息。

static void Main(string[] args)
{
    var fileInfo = new FileInfo(@"c:\temp\MyDocWithImages.docx");
    string fullFilePath = fileInfo.FullName;
    string htmlText = string.Empty;
    try
    {
        htmlText = ParseDOCX(fileInfo);
    }
    catch (OpenXmlPackageException e)
    {
        if (e.ToString().Contains("Invalid Hyperlink"))
        {
            using (FileStream fs = new FileStream(fullFilePath,FileMode.OpenOrCreate, FileAccess.ReadWrite))
            {
                UriFixer.FixInvalidUri(fs, brokenUri => FixUri(brokenUri));
            }
            htmlText = ParseDOCX(fileInfo);
        }
    }

    var writer = File.CreateText("test1.html");
    writer.WriteLine(htmlText.ToString());
    writer.Dispose();
}
        
public static Uri FixUri(string brokenUri)
{
    string newURI = string.Empty;
    if (brokenUri.Contains("mailto:"))
    {
        int mailToCount = "mailto:".Length;
        brokenUri = brokenUri.Remove(0, mailToCount);
        newURI = brokenUri;
    }
    else
    {
        newURI = " ";
    }
    return new Uri(newURI);
}

public static string ParseDOCX(FileInfo fileInfo)
{
    try
    {
        byte[] byteArray = File.ReadAllBytes(fileInfo.FullName);
        using (MemoryStream memoryStream = new MemoryStream())
        {
            memoryStream.Write(byteArray, 0, byteArray.Length);
            using (WordprocessingDocument wDoc =
                                        WordprocessingDocument.Open(memoryStream, true))
            {
                int imageCounter = 0;
                var pageTitle = fileInfo.FullName;
                var part = wDoc.CoreFilePropertiesPart;
                if (part != null)
                    pageTitle = (string)part.GetXDocument()
                                            .Descendants(DC.title)
                                            .FirstOrDefault() ?? fileInfo.FullName;

                WmlToHtmlConverterSettings settings = new WmlToHtmlConverterSettings()
                {
                    AdditionalCss = "body { margin: 1cm auto; max-width: 20cm; padding: 0; }",
                    PageTitle = pageTitle,
                    FabricateCssClasses = true,
                    CssClassPrefix = "pt-",
                    RestrictToSupportedLanguages = false,
                    RestrictToSupportedNumberingFormats = false,
                    ImageHandler = imageInfo =>
                    {
                        ++imageCounter;
                        string extension = imageInfo.ContentType.Split('/')[1].ToLower();
                        ImageFormat imageFormat = null;
                        if (extension == "png") imageFormat = ImageFormat.Png;
                        else if (extension == "gif") imageFormat = ImageFormat.Gif;
                        else if (extension == "bmp") imageFormat = ImageFormat.Bmp;
                        else if (extension == "jpeg") imageFormat = ImageFormat.Jpeg;
                        else if (extension == "tiff")
                        {
                            extension = "gif";
                            imageFormat = ImageFormat.Gif;
                        }
                        else if (extension == "x-wmf")
                        {
                            extension = "wmf";
                            imageFormat = ImageFormat.Wmf;
                        }

                        if (imageFormat == null) return null;

                        string base64 = null;
                        try
                        {
                            using (MemoryStream ms = new MemoryStream())
                            {
                                imageInfo.Bitmap.Save(ms, imageFormat);
                                var ba = ms.ToArray();
                                base64 = System.Convert.ToBase64String(ba);
                            }
                        }
                        catch (System.Runtime.InteropServices.ExternalException)
                        { return null; }

                        ImageFormat format = imageInfo.Bitmap.RawFormat;
                        ImageCodecInfo codec = ImageCodecInfo.GetImageDecoders()
                                                    .First(c => c.FormatID == format.Guid);
                        string mimeType = codec.MimeType;

                        string imageSource =
                                string.Format("data:{0};base64,{1}", mimeType, base64);

                        XElement img = new XElement(Xhtml.img,
                                new XAttribute(NoNamespace.src, imageSource),
                                imageInfo.ImgStyleAttribute,
                                imageInfo.AltText != null ?
                                    new XAttribute(NoNamespace.alt, imageInfo.AltText) : null);
                        return img;
                    }
                };

                XElement htmlElement = WmlToHtmlConverter.ConvertToHtml(wDoc, settings);
                var html = new XDocument(new XDocumentType("html", null, null, null),
                                                                            htmlElement);
                var htmlString = html.ToString(SaveOptions.DisableFormatting);
                return htmlString;
            }
        }
    }
    catch
    {
        return "The file is either open, please close it or contains corrupt data";
    }
}

You may need System.Drawing.Common NuGet package to use ImageFormat您可能需要 System.Drawing.Common NuGet 包才能使用 ImageFormat

Now we can get images:现在我们可以获取图像:

在此处输入图像描述

If you only want to show Word .docx files in a web browser its better not to convert the HTML to PDF as that will significantly increase bandwidth.如果您只想在 Web 浏览器中显示 Word .docx 文件,最好不要将 HTML 转换为 PDF,因为这会显着增加带宽。 You could store the HTML in a file system, cloud, or in a dB using a VPP Technology.您可以使用 VPP 技术将 HTML 存储在文件系统、云或 dB 中。


HTML to PDF HTML 转 PDF

Next thing we need to do is pass the HTML to DinkToPdf.接下来我们需要做的是将 HTML 传递给 DinkToPdf。 Download the DinkToPdf (90 MB) solution.下载 DinkToPdf (90 MB) 解决方案。 Build the solution - it will take a while for all the packages to be restored and for the solution to Compile.构建解决方案 - 恢复所有包并编译解决方案需要一段时间。

IMPORTANT:重要的:

The DinkToPdf library requires the libwkhtmltox.so and libwkhtmltox.dll file in the root of your project if you want to run on Linux and Windows.如果要在 Linux 和 Windows 上运行,DinkToPdf 库需要项目根目录中的 libwkhtmltox.so 和 libwkhtmltox.dll 文件。 There's also a libwkhtmltox.dylib file for Mac if you need it.如果需要,还有一个适用于 Mac 的 libwkhtmltox.dylib 文件。

These DLLs are in the v0.12.4 folder.这些 DLL 位于 v0.12.4 文件夹中。 Depending on your PC, 32 or 64 bit, copy the 3 files to the DinkToPdf-master\DinkToPfd.TestConsoleApp\bin\Debug\netcoreapp1.1 folder.根据您的 PC(32 位或 64 位),将 3 个文件复制到 DinkToPdf-master\DinkToPfd.TestConsoleApp\bin\Debug\netcoreapp1.1 文件夹。

IMPORTANT 2:重要2:

Make sure that you have libgdiplus installed in your Docker image or on your Linux machine.确保在 Docker 映像或 Linux 机器上安装了 libgdiplus。 The libwkhtmltox.so library depends on it. libwkhtmltox.so 库依赖于它。

Set the DinkToPfd.TestConsoleApp as StartUp project and change the Program.cs file to read the htmlContent from the HTML file saved with Open-Xml-PowerTools instead of the Lorium Ipsom text.将 DinkToPfd.TestConsoleApp 设置为 StartUp 项目并更改 Program.cs 文件以从使用 Open-Xml-PowerTools 而不是 Lorium Ipsom 文本保存的 HTML 文件中读取 htmlContent。

var doc = new HtmlToPdfDocument()
{
    GlobalSettings = {
        ColorMode = ColorMode.Color,
        Orientation = Orientation.Landscape,
        PaperSize = PaperKind.A4,
    },
    Objects = {
        new ObjectSettings() {
            PagesCount = true,
            HtmlContent = File.ReadAllText(@"C:\TFS\Sandbox\Open-Xml-PowerTools-abfbaac510d0d60e2f492503c60ef897247716cf\ToolsTest\test1.html"),
            WebSettings = { DefaultEncoding = "utf-8" },
            HeaderSettings = { FontSize = 9, Right = "Page [page] of [toPage]", Line = true },
            FooterSettings = { FontSize = 9, Right = "Page [page] of [toPage]" }
        }
    }
};

The result of the Docx vs the PDF is quite impressive and I doubt many people would pick out many differences (especially if they never see the original): Docx 与 PDF 的结果令人印象深刻,我怀疑很多人会找出许多不同之处(尤其是如果他们从未看过原件):

在此处输入图像描述

Ps.附言。 I realise you wanted to convert both .doc and .docx to PDF.我意识到您想将.doc.docx都转换为 PDF。 I'd suggest making a service yourself to convert .doc to docx using a specific non-server Windows/Microsoft technology.我建议自己制作一项服务,使用特定的非服务器 Windows/Microsoft 技术将 .doc 转换为 docx。 The doc format is binary and is not intended for server side automation of office . doc 格式是二进制的,不适用于office 的服务器端自动化


With an EXE and Command Line:使用 EXE 和命令行:

You can convert purely with the wkhtmltopdf.exe available here: https://wkhtmltopdf.org/libwkhtmltox/您可以使用此处提供的 wkhtmltopdf.exe 进行纯粹转换: https ://wkhtmltopdf.org/libwkhtmltox/

Using the LibreOffice binary使用 LibreOffice 二进制文件

The LibreOffice project is a Open Source cross-platform alternative for MS Office. LibreOffice 项目是 MS Office 的开源跨平台替代方案。 We can use its capabilities to export doc and docx files to PDF .我们可以使用它的功能将docdocx文件导出为PDF Currently, LibreOffice has no official API for .NET, therefore, we will talk directly to the soffice binary.目前,LibreOffice 没有用于 .NET 的官方 API,因此,我们将直接讨论soffice二进制文件。

It is a kind of a "hacky" solution, but I think it is the solution with less amount of bugs and maintaining costs possible.这是一种“hacky”解决方案,但我认为它是具有较少错误和可能维护成本的解决方案。 Another advantage of this method is that you are not restricted to converting from doc and docx : you can convert it from every format LibreOffice support (eg odt, html, spreadsheet, and more).此方法的另一个优点是您不限于从docdocx转换:您可以从 LibreOffice 支持的每种格式(例如 odt、html、电子表格等)转换它。

The implementation实施

I wrote a simple c# program that uses the soffice binary.我编写了一个使用soffice二进制文件的简单c#程序。 This is just a proof-of-concept (and my first program in c# ).这只是一个概念验证(也是我在c#中的第一个程序)。 It supports Windows out of the box and Linux only if the LibreOffice package has been installed.只有安装了 LibreOffice 软件包,它才支持开箱即用的WindowsLinux

This is main.cs :这是main.cs

using System;
using System.Collections.Generic;
using System.Text;
using System.Diagnostics;
using System.Reflection;

namespace DocToPdf
{
    public class LibreOfficeFailedException : Exception
    {
        public LibreOfficeFailedException(int exitCode)
            : base(string.Format("LibreOffice has failed with {}", exitCode))
            {}
    }

    class Program
    {
        static string getLibreOfficePath() {
            switch (Environment.OSVersion.Platform) {
                case PlatformID.Unix:
                    return "/usr/bin/soffice";
                case PlatformID.Win32NT:
                    string binaryDirectory = System.IO.Path.GetDirectoryName(Assembly.GetExecutingAssembly().Location);
                    return binaryDirectory + "\\Windows\\program\\soffice.exe";
                default:
                    throw new PlatformNotSupportedException ("Your OS is not supported");
            }
        }

        static void Main(string[] args) {
            string libreOfficePath = getLibreOfficePath();

            // FIXME: file name escaping: I have not idea how to do it in .NET.
            ProcessStartInfo procStartInfo = new ProcessStartInfo(libreOfficePath, string.Format("--convert-to pdf --nologo {0}", args[0]));
            procStartInfo.RedirectStandardOutput = true;
            procStartInfo.UseShellExecute = false;
            procStartInfo.CreateNoWindow = true;
            procStartInfo.WorkingDirectory = Environment.CurrentDirectory;

            Process process = new Process() { StartInfo =      procStartInfo, };
            process.Start();
            process.WaitForExit();

            // Check for failed exit code.
            if (process.ExitCode != 0) {
                throw new LibreOfficeFailedException(process.ExitCode);
            }
        }
    }
}

Resources资源

Results结果

I had tested it on Arch Linux, compiled with mono .我在 Arch Linux 上测试过,用mono编译。 I run it using mon and the Linux binary, and with wine : using the Windows binary.我使用 mon 和 Linux 二进制文件运行它,并使用wine运行它:使用 Windows 二进制文件。

You can find the results in the Tests directory:您可以在Tests目录中找到结果:

Input files: testdoc.doc , testdocx.docx输入文件: testdoc.doc , testdocx.docx

Outputs:输出:

I've recently done this with FreeSpire.Doc .我最近用FreeSpire.Doc做到了这一点。 It has a limit of 3 pages for the free version but it can easily convert a docx file into PDF using something like this:免费版的限制为 3 页,但它可以使用以下方式轻松地将 docx 文件转换为 PDF:

private void ConvertToPdf()
{
    try
    {
        for (int i = 0; i < listOfDocx.Count; i++)
        {
            CurrentModalText = "Converting To PDF";
            CurrentLoadingNum += 1;

            string savePath = PdfTempStorage + i + ".pdf";
            listOfPDF.Add(savePath);

            Spire.Doc.Document document = new Spire.Doc.Document(listOfDocx[i], FileFormat.Auto);
            document.SaveToFile(savePath, FileFormat.PDF);
        }
    }
    catch (Exception e)
    {
        throw e;
    }
}

I then sew these individual PDFs together later using iTextSharp.pdf :然后,我稍后使用iTextSharp.pdf将这些单独的 PDF 缝合在一起:

public static byte[] concatAndAddContent(List<byte[]> pdfByteContent, List<MailComm> localList)
{
    using (var ms = new MemoryStream())
    {
        using (var doc = new Document())
        {
            using (var copy = new PdfSmartCopy(doc, ms))
            {
                doc.Open();
                // add checklist at the start
                using (var db = new StudyContext())
                {
                    var contentId = localList[0].ContentID;
                    var temp = db.MailContentTypes.Where(x => x.ContentId == contentId).ToList();
                    if (!temp[0].Code.Equals("LAB"))
                    {
                        pdfByteContent.Insert(0, CheckListCreation.createCheckBox(localList));
                    }
                }

                // Loop through each byte array
                foreach (var p in pdfByteContent)
                {
                    // Create a PdfReader bound to that byte array
                    using (var reader = new PdfReader(p))
                    {
                        // Add the entire document instead of page-by-page
                        copy.AddDocument(reader);
                    }
                }

                doc.Close();
            }
        }

        // Return just before disposing
        return ms.ToArray();
    }
}

I don't know if this suits your use case, as you haven't specified the size of the documents you're trying to write, but if they're < 3 pages or you can manipulate them to be less than 3 pages, it will allow you to convert them into PDFs.我不知道这是否适合您的用例,因为您没有指定要编写的文档的大小,但是如果它们小于 3 页,或者您可以将它们操作为小于 3 页,它将允许您将它们转换为 PDF。

As mentioned in the comments below, it is also unable to help with RTL languages, thank you @Aria for pointing that out.正如下面评论中提到的,它也无法帮助使用 RTL 语言,感谢@Aria 指出这一点。

Sorry I don't have enough reputation to comment but would like to put my two cents on Jeremy Thompson's answer.抱歉,我没有足够的声誉来发表评论,但想把我的两分钱放在杰里米汤普森的回答上。 And hope this help someone.并希望这对某人有所帮助。

When I was going through Jeremy Thompson's answer, after downloading OpenXMLSDK-PowerTools and run OpenXMLPowerTools.Core.Example , I got error like当我通过 Jeremy Thompson 的回答时,在下载OpenXMLSDK-PowerTools并运行OpenXMLPowerTools.Core.Example ,我得到了类似的错误

the specified package is invalid. the main part is missing

at the line在线

var document = WordprocessingDocument.Open(source);

After struggling for some hours, I found that the test.docx copied to bin file is only 1kb.挣扎了几个小时,我发现复制到bin文件的test.docx只有1kb。 To solve this, right click test.docx > Properties , set Copy to Output Directory to Copy always solves this problem.要解决此问题,请右键单击test.docx > Properties ,将Copy to Output Directory设置为Copy always可以解决此问题。

Hope this help some novice like me :)希望这对像我这样的新手有帮助:)

For converting DOCX to PDF even with placeholders, I have created a free "Report-From-DocX-HTML-To-PDF-Converter" library with .NET CORE under the MIT license , because I was so unnerved that no simple solution existed and all the commercial solutions were super expensive.为了即使使用占位符也将 DOCX 转换为 PDF,我在MIT 许可下使用 .NET CORE 创建了一个免费的“Report-From-DocX-HTML-To-PDF-Converter”库,因为我非常不安,以至于没有简单的解决方案存在并且所有的商业解决方案都非常昂贵。 You can find it here with an extensive description and an example project:你可以在这里找到它的详细描述和示例项目:

https://github.com/smartinmedia/Net-Core-DocX-HTML-To-PDF-Converter https://github.com/smartinmedia/Net-Core-DocX-HTML-To-PDF-Converter

You only need the free LibreOffice.您只需要免费的 LibreOffice。 I recommend using the LibreOffice portable edition, so it does not change anything in your server settings.我推荐使用 LibreOffice 便携版,因此它不会改变您的服务器设置中的任何内容。 Have a look, where the file "soffice.exe" (on Linux it is called differently) located, because you need it to fill the variable "locationOfLibreOfficeSoffice".看一下文件“soffice.exe”(在 Linux 上的名称不同)所在的位置,因为您需要它来填充变量“locationOfLibreOfficeSoffice”。

Here is how it works to convert from DOCX to HTML:以下是从 DOCX 转换为 HTML 的工作原理:

string locationOfLibreOfficeSoffice =   @"C:\PortableApps\LibreOfficePortable\App\libreoffice\program\soffice.exe";

var docxLocation = "MyWordDocument.docx";

var rep = new ReportGenerator(locationOfLibreOfficeSoffice);

//Convert from DOCX to PDF
test.Convert(docxLocation, Path.Combine(Path.GetDirectoryName(docxLocation), "Test-Template-out.pdf"));


//Convert from DOCX to HTML
test.Convert(docxLocation, Path.Combine(Path.GetDirectoryName(docxLocation), "Test-Template-out.html"));

As you see, you can also convert from DOCX to HTML.如您所见,您还可以从 DOCX 转换为 HTML。 Also, you can put placeholders into the Word document, which you can then "fill" with values.此外,您可以将占位符放入 Word 文档中,然后您可以在其中“填充”值。 However, this is not in the scope of your question, but you can read about that on Github (README).但是,这不在您的问题范围内,但您可以在 Github (README) 上阅读相关内容。

This is adding to Jeremy Thompson's very helpful answer.这增加了杰里米汤普森的非常有用的答案。 In addition to the word document body, I wanted the header (and footer) of the word document converted to HTML.除了 word 文档正文之外,我还希望将 word 文档的页眉(和页脚)转换为 HTML。 I didn't want to modify the Open-Xml-PowerTools so I modified Main() and ParseDOCX() from Jeremy's example, and added two new functions.我不想修改 Open-Xml-PowerTools,所以我修改了 Jeremy 示例中的 Main() 和 ParseDOCX(),并添加了两个新函数。 ParseDOCX now accepts a byte array so the original Word Docx isn't modified. ParseDOCX 现在接受一个字节数组,因此原始 Word Docx 不会被修改。

static void Main(string[] args)
{
    var fileInfo = new FileInfo(@"c:\temp\MyDocWithImages.docx");
    byte[] fileBytes = File.ReadAllBytes(fileInfo.FullName);
    string htmlText = string.Empty;
    string htmlHeader = string.Empty;
    try
    {
        htmlText = ParseDOCX(fileBytes, fileInfo.Name, false);
        htmlHeader = ParseDOCX(fileBytes, fileInfo.Name, true);
    }
    catch (OpenXmlPackageException e)
    {
        if (e.ToString().Contains("Invalid Hyperlink"))
        {
            using (FileStream fs = new FileStream(fullFilePath, FileMode.OpenOrCreate, FileAccess.ReadWrite))
            {
                UriFixer.FixInvalidUri(fs, brokenUri => FixUri(brokenUri));
            }
            htmlText = ParseDOCX(fileBytes, fileInfo.Name, false);
            htmlHeader = ParseDOCX(fileBytes, fileInfo.Name, true);
        }
    }

    var writer = File.CreateText("test1.html");
    writer.WriteLine(htmlText.ToString());
    writer.Dispose();
    var writer2 = File.CreateText("header1.html");
    writer2.WriteLine(htmlHeader.ToString());
    writer2.Dispose();
}

private static string ParseDOCX(byte[] fileBytes, string filename, bool headerOnly)
{
    try
    {
        using (MemoryStream memoryStream = new MemoryStream())
        {
            memoryStream.Write(fileBytes, 0, fileBytes.Length);
            using (WordprocessingDocument wDoc = WordprocessingDocument.Open(memoryStream, true))
            {
                int imageCounter = 0;
                var pageTitle = filename;
                var part = wDoc.CoreFilePropertiesPart;
                if (part != null)
                {
                    pageTitle = (string)part.GetXDocument()
                                            .Descendants(DC.title)
                                            .FirstOrDefault() ?? filename;
                }

                WmlToHtmlConverterSettings settings = new WmlToHtmlConverterSettings()
                {
                    AdditionalCss = "body { margin: 1cm auto; max-width: 20cm; padding: 0; }",
                    PageTitle = pageTitle,
                    FabricateCssClasses = true,
                    CssClassPrefix = "pt-",
                    RestrictToSupportedLanguages = false,
                    RestrictToSupportedNumberingFormats = false,
                    ImageHandler = imageInfo =>
                    {
                        ++imageCounter;
                        string extension = imageInfo.ContentType.Split('/')[1].ToLower();
                        ImageFormat imageFormat = null;
                        if (extension == "png") imageFormat = ImageFormat.Png;
                        else if (extension == "gif") imageFormat = ImageFormat.Gif;
                        else if (extension == "bmp") imageFormat = ImageFormat.Bmp;
                        else if (extension == "jpeg") imageFormat = ImageFormat.Jpeg;
                        else if (extension == "tiff")
                        {
                            extension = "gif";
                            imageFormat = ImageFormat.Gif;
                        }
                        else if (extension == "x-wmf")
                        {
                            extension = "wmf";
                            imageFormat = ImageFormat.Wmf;
                        }

                        if (imageFormat == null) return null;

                        string base64 = null;
                        try
                        {
                            using (MemoryStream ms = new MemoryStream())
                            {
                                imageInfo.Bitmap.Save(ms, imageFormat);
                                var ba = ms.ToArray();
                                base64 = System.Convert.ToBase64String(ba);
                            }
                        }
                        catch (System.Runtime.InteropServices.ExternalException)
                        { return null; }

                        ImageFormat format = imageInfo.Bitmap.RawFormat;
                        ImageCodecInfo codec = ImageCodecInfo.GetImageDecoders()
                                                    .First(c => c.FormatID == format.Guid);
                        string mimeType = codec.MimeType;

                        string imageSource =
                                string.Format("data:{0};base64,{1}", mimeType, base64);

                        XElement img = new XElement(Xhtml.img,
                                new XAttribute(NoNamespace.src, imageSource),
                                imageInfo.ImgStyleAttribute,
                                imageInfo.AltText != null ?
                                    new XAttribute(NoNamespace.alt, imageInfo.AltText) : null);
                        return img;
                    }
                };

                // Put header into document body, and remove everything else
                if (headerOnly)
                {
                    MoveHeaderToDocumentBody(wDoc);
                }

                XElement htmlElement = WmlToHtmlConverter.ConvertToHtml(wDoc, settings);
                var html = new XDocument(new XDocumentType("html", null, null, null),
                                                                            htmlElement);
                var htmlString = html.ToString(SaveOptions.DisableFormatting);
                return htmlString;
            }
        }
    }
    catch
    {
        return "The file is either open, please close it or contains corrupt data";
    }
}

private static void MoveHeaderToDocumentBody(WordprocessingDocument wDoc)
{
    MainDocumentPart mainDocument = wDoc.MainDocumentPart;
    XElement docRoot = mainDocument.GetXDocument().Root;
    XElement body = docRoot.Descendants(W.body).First();
    // Only handles first header. Header info: https://docs.microsoft.com/en-us/office/open-xml/how-to-replace-the-header-in-a-word-processing-document
    HeaderPart header = mainDocument.HeaderParts.FirstOrDefault();
    XElement headerRoot = header.GetXDocument().Root;

    AddXElementToBody(headerRoot, body);

    // document body will have new headers when we return from this function
    return;
}

private static void AddXElementToBody(XElement sourceElement, XElement body)
{
    // Clone the children nodes
    List<XElement> children = sourceElement.Elements().ToList();
    List<XElement> childClones = children.Select(el => new XElement(el)).ToList();

    // Clone the section properties nodes
    List<XElement> sections = body.Descendants(W.sectPr).ToList();
    List<XElement> sectionsClones = sections.Select(el => new XElement(el)).ToList();

    // clear body
    body.Descendants().Remove();

    // add source elements to body
    foreach (var child in childClones)
    {
        body.Add(child);
    }

    // add section properties to body
    foreach (var section in sectionsClones)
    {
        body.Add(section);
    }

    // get text from alternate content if needed - either choice or fallback node
    XElement alternate = body.Descendants(MC.AlternateContent).FirstOrDefault();
    if (alternate != null)
    {
        var choice = alternate.Descendants(MC.Choice).FirstOrDefault();
        var fallback = alternate.Descendants(MC.Fallback).FirstOrDefault();
        if (choice != null)
        {
            var choiceChildren = choice.Elements();
            foreach(var choiceChild in choiceChildren)
            {
                body.Add(choiceChild);
            }
        }
        else if (fallback != null)
        {
            var fallbackChildren = fallback.Elements();
            foreach (var fallbackChild in fallbackChildren)
            {
                body.Add(fallbackChild);
            }
        }
    }
}

You could add similar methods to handle the Word document footer.您可以添加类似的方法来处理 Word 文档页脚。

In my case, I then convert the HTML files to images (using Net-Core-Html-To-Image , also based on wkHtmlToX).就我而言,然后我将 HTML 文件转换为图像(使用Net-Core-Html-To-Image ,也基于 wkHtmlToX)。 I combine the header and body images together (using Magick.NET-Q16-AnyCpu ), placing the header image at the top of the body image.我将标题和正文图像组合在一起(使用Magick.NET-Q16-AnyCpu ),将标题图像放在正文图像的顶部。

An alternate solution could be implemented if you have access to office 365. This has less limitations than my previous answer but requires that purchase.如果您可以访问 Office 365,则可以实施替代解决方案。这比我之前的答案限制更少,但需要购买。

I get a graph API token, the site I'm wanting to work with and the drive I'm wanting to use.我得到了一个图形 API 令牌、我想要使用的站点以及我想要使用的驱动器。

After that i grab the byte array of the docx之后我抓取了 docx 的字节数组

    public static async Task<Stream> GetByteArrayOfDocumentAsync(string baseFilePathLocation)
    {
        var byteArray = File.ReadAllBytes(baseFilePathLocation);
        using var stream = new MemoryStream();
        stream.Write(byteArray, 0, (int) byteArray.Length);

        return stream;
    }

This stream is then uploaded to the graph api using a client setup with our graph api token via然后,使用带有我们的图形 API 令牌的客户端设置,通过以下方式将此流上传到图形 API

        public static async Task<string> UploadFileAsync(HttpClient client,
                                                     string siteId,
                                                     MemoryStream stream,
                                                     string driveId,
                                                     string fileName,
                                                     string folderName = "root")
    {

        var result = await client.PutAsync(
            $"https://graph.microsoft.com/v1.0/sites/{siteId}/drives/{driveId}/items/{folderName}:/{fileName}:/content",
            new ByteArrayContent(stream.ToArray()));
        var res = JsonSerializer.Deserialize<SharepointDocument>(await result.Content.ReadAsStringAsync());
        return res.id;
    }

We then download from graph api using that api given to get a PDF via然后,我们使用提供的 api 从图形 api 下载以通过以下方式获取 PDF

        public static async Task<Stream> GetPdfOfDocumentAsync(HttpClient client,
                                                            string siteId,
                                                            string driveId,
                                                            string documentId)
    {


        var getRequest =
            await client.GetAsync(
                $"https://graph.microsoft.com/v1.0/sites/{siteId}/drives/{driveId}/items/{documentId}/content?format=pdf");
        return await getRequest.Content.ReadAsStreamAsync();

    }

This gives a stream composed off the document that was just created.这给出了一个由刚刚创建的文档组成的流。

If you have no trouble using containerized solution (Docker), there is a very good project out there:如果您使用容器化解决方案 (Docker) 没有问题,那么这里有一个非常好的项目:

Project Gotenberg哥腾堡计划

https://gotenberg.dev/ https://gotenberg.dev/

I did give it a try before.我之前确实尝试过。 It already uses LibreOffice for docx to pdf but it has many more features.它已经使用 LibreOffice for docx to pdf,但它还有更多功能。 Plus it's a stateless dockerized api, which is self sufficient.另外,它是一个无状态的 dockerized api,它是自给自足的。

I know this can be done using Microsoft.Office.Interop.Word, but my application is .NET Core and does not have access to Office interop.我知道这可以使用 Microsoft.Office.Interop.Word 来完成,但我的应用程序是 .NET Core,无法访问 Office 互操作。

Maybe this is not true?也许这不是真的? You CAN load assemblies in dotnet core, however, loading interop components may be a challenge since dotnet core is host agnostic.您可以在 dotnet core 中加载程序集,但是,加载互操作组件可能是一个挑战,因为 dotnet core 与主机无关。

Here is the thing though you don't need to install Office to obtain the Office primary interop assemblies.尽管您不需要安装 Office 来获取 Office 主要互操作程序集,但这是一件事情。 You can try loading the assemblies without using COM+ though this maybe a bit tricky?您可以尝试在不使用 COM+ 的情况下加载程序集,尽管这可能有点棘手? I'm actually not sure if this can be done, but I think in theory you should be able to do it.我实际上不确定这是否可以做到,但我认为理论上你应该能够做到。 Has anyone thought to try this without installing office?有没有人想过在不安装办公室的情况下尝试这个?

Here is the link to office PIA https://www.microsoft.com/en-us/download/confirmation.aspx?id=3508这是办公室 PIA 的链接https://www.microsoft.com/en-us/download/confirmation.aspx?id=3508

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 没有microsoft.office.interop的情况下如何将PDF转换为doc? - How can I convert PDF to doc without microsoft.office.interop? 使用Microsoft.Office.Interop读取Word docx中的字符串 - read string in word docx with Microsoft.Office.Interop 如何在不使用microsoft.office.interop的情况下将Word文档转换为C#中的文本文件? - How to convert a word document to a text file in c# without using microsoft.office.interop? 使用Microsoft.Office.Interop Word和Excel - using Microsoft.Office.Interop Word and Excel Microsoft.Office.Interop - Microsoft.Office.Interop 使用Microsoft.Office.Interop将C#html转换为docx - c# html to docx conversion using Microsoft.Office.Interop 是否有任何免费库可以将 doc 转换为 pdf,而无需在 c# 环境中使用 Microsoft.Office.Interop.Word - Is there any free library to covert doc to pdf without using Microsoft.Office.Interop.Word in c# environment Microsoft.Office.Interop IndexOutOfRangeException - Microsoft.Office.Interop IndexOutOfRangeException 在 .NET Core 中使用 b2xtranslator 库将 Word doc 转换为 docx 格式 - Convert Word doc to docx format in .NET Core using b2xtranslator library Microsoft.Office.Interop限制段落修改 - Microsoft.Office.Interop restrict the paragraph modification
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM