简体   繁体   English

使用 OpenXML 从 HTML 文件生成 docx 文件

[英]Generating docx file from HTML file using OpenXML

I'm using this method for generating docx file:我正在使用这种方法来生成docx文件:

public static void CreateDocument(string documentFileName, string text)
{
    using (WordprocessingDocument wordDoc =
        WordprocessingDocument.Create(documentFileName, WordprocessingDocumentType.Document))
    {
        MainDocumentPart mainPart = wordDoc.AddMainDocumentPart();

        string docXml =
                    @"<?xml version=""1.0"" encoding=""UTF-8"" standalone=""yes""?>
                 <w:document xmlns:w=""http://schemas.openxmlformats.org/wordprocessingml/2006/main"">
                 <w:body><w:p><w:r><w:t>#REPLACE#</w:t></w:r></w:p></w:body>
                 </w:document>";

        docXml = docXml.Replace("#REPLACE#", text);

        using (Stream stream = mainPart.GetStream())
        {
            byte[] buf = (new UTF8Encoding()).GetBytes(docXml);
            stream.Write(buf, 0, buf.Length);
        }
    }
}

It works like a charm:它就像一个魅力:

CreateDocument("test.docx", "Hello");

But what if I want to put HTML content instead of Hello ?但是如果我想放置 HTML 内容而不是Hello呢? for example:例如:

CreateDocument("test.docx", @"<html><head></head>
                              <body>
                                    <h1>Hello</h1>
                              </body>
                        </html>");

Or something like this:或者像这样:

CreateDocument("test.docx", @"Hello<BR>
                                    This is a simple text<BR>
                                    Third paragraph<BR>
                                    Sign
                        ");

both cases creates an invalid structure for document.xml .这两种情况都会为document.xml创建一个无效的结构。 Any idea?有什么想法吗? How can I generate a docx file from a HTML content?如何从 HTML 内容生成 docx 文件?

I realize I'm 7 years late to the game here.我意识到我在这里玩游戏晚了 7 年。 Still, for future people searching on how to convert from HTML to Word Doc, this blog posting on a Microsoft MSDN site gives most of the ingredients necessary to do this using OpenXML.尽管如此,对于未来搜索如何从 HTML 转换为 Word Doc 的人,Microsoft MSDN 站点上的这篇博客文章提供了使用 OpenXML 执行此操作所需的大部分要素。 I found the post itself to be confusing, but the source code that he included clarified it all for me.我发现帖子本身令人困惑,但他包含的代码为我澄清了这一切。

The only piece that was missing was how to build a Docx file from scratch, instead of how to merge into an existing one as his example shows.唯一缺少的部分是如何从头开始构建 Docx 文件,而不是如他的示例所示如何合并到现有文件中。 I found that tidbit from here .我从这里找到了那个花絮。

Unfortunately the project I used this in is written in vb.net.不幸的是,我在其中使用的项目是用 vb.net 编写的。 So I'm going to share the vb.net code first, then an automated c# conversion of it, that may or may not be accurate.所以我将首先分享 vb.net 代码,然后是它的自动 c# 转换,这可能准确也可能不准确。

vb.net code: vb.net 代码:

Imports DocumentFormat.OpenXml
Imports DocumentFormat.OpenXml.Packaging
Imports DocumentFormat.OpenXml.Wordprocessing
Imports System.IO

Dim ms As IO.MemoryStream
Dim mainPart As MainDocumentPart
Dim b As Body
Dim d As Document
Dim chunk As AlternativeFormatImportPart
Dim altChunk As AltChunk

Const altChunkID As String = "AltChunkId1"

ms = New MemoryStream()

Using myDoc = WordprocessingDocument.Create(ms,WordprocessingDocumentType.Document)
    mainPart = myDoc.MainDocumentPart

    If mainPart Is Nothing Then
        mainPart = myDoc.AddMainDocumentPart()

        b = New Body()
        d = New Document(b)
        d.Save(mainPart)
    End If

    chunk = mainPart.AddAlternativeFormatImportPart(AlternativeFormatImportPartType.Xhtml, altChunkID)

    Using chunkStream As Stream = chunk.GetStream(FileMode.Create, FileAccess.Write)
        Using stringStream As StreamWriter = New StreamWriter(chunkStream)
            stringStream.Write("YOUR HTML HERE")
        End Using
    End Using

    altChunk = New AltChunk()
    altChunk.Id = altChunkID
    mainPart.Document.Body.InsertAt(Of AltChunk)(altChunk, 0)
    mainPart.Document.Save()
End Using

c# code:代码:

using DocumentFormat.OpenXml;
using DocumentFormat.OpenXml.Packaging;
using DocumentFormat.OpenXml.Wordprocessing;
using System.IO;

IO.MemoryStream ms;
MainDocumentPart mainPart;
Body b;
Document d;
AlternativeFormatImportPart chunk;
AltChunk altChunk;

string altChunkID = "AltChunkId1";

ms = new MemoryStream();

Using (myDoc = WordprocessingDocument.Create(ms, WordprocessingDocumentType.Document))
{
    mainPart = myDoc.MainDocumentPart;

    if (mainPart == null) 
    {
         mainPart = myDoc.AddMainDocumentPart();
         b = new Body();
         d = new Document(b);
         d.Save(mainPart);
    }

    chunk = mainPart.AddAlternativeFormatImportPart(AlternativeFormatImportPartType.Xhtml, altChunkID);

    Using (Stream chunkStream = chunk.GetStream(FileMode.Create, FileAccess.Write)
    {
         Using (StreamWriter stringStream = new StreamWriter(chunkStream))         
         {
              stringStream.Write("YOUR HTML HERE");
         }
    }    

    altChunk = new AltChunk();
    altChunk.Id = altChunkID;
    mainPart.Document.Body.InsertAt(Of, AltChunk)[altChunk, 0];
    mainPart.Document.Save();
}

Note that I'm using the ms memory stream in another routine, which is where it's disposed of after use.请注意,我在另一个例程中使用了ms内存流,这是它在使用后被处理掉的地方。

I hope this helps someone else!我希望这对其他人有帮助!

You cannot just insert the HTML content into a "document.xml", this part expects only a WordprocessingML content so you'll have to convert that HTML into WordprocessingML, see this .您不能只是将 HTML 内容插入到“document.xml”中,这部分只需要 WordprocessingML 内容,因此您必须将该 HTML 转换为 WordprocessingML, 请参阅此

Another thing that you could use is altChunk element, with it you would be able to place a HTML file inside your DOCX file and then reference that HTML content on some specific place inside your document, see this .您可以使用的另一件事是 altChunk 元素,通过它您可以在 DOCX 文件中放置一个 HTML 文件,然后在文档中的某个特定位置引用该 HTML 内容, 请参阅此

Last as an alternative, with GemBox.Document library you could accomplish exactly what you want, see the following:最后作为替代方案,使用GemBox.Document 库,您可以完全完成您想要的操作,请参阅以下内容:

public static void CreateDocument(string documentFileName, string text)
{
    DocumentModel document = new DocumentModel();
    document.Content.LoadText(text, LoadOptions.HtmlDefault);
    document.Save(documentFileName);
}

Or you could actually straightforwardly convert a HTML content into a DOCX file:或者您实际上可以直接将 HTML 内容转换为 DOCX 文件:

public static void Convert(string documentFileName, string htmlText)
{
    HtmlLoadOptions options = LoadOptions.HtmlDefault;
    using (var htmlStream = new MemoryStream(options.Encoding.GetBytes(htmlText)))
        DocumentModel.Load(htmlStream, options)
                     .Save(documentFileName);
}

I could successfully convert HTML content to docx file using OpenXML in an .net Core using this code我可以使用此代码在 .net Core 中使用 OpenXML 成功将 HTML 内容转换为 docx 文件

string html = "<strong>Hello</strong> World";
using (MemoryStream generatedDocument = new MemoryStream()){
   using (WordprocessingDocument package = 
                  WordprocessingDocument.Create(generatedDocument,
                  WordprocessingDocumentType.Document)){
   MainDocumentPart mainPart = package.MainDocumentPart;
   if (mainPart == null){
    mainPart = package.AddMainDocumentPart();
    new Document(new Body()).Save(mainPart);
}
HtmlConverter converter = new HtmlConverter(mainPart);
converter.ParseHtml(html);
mainPart.Document.Save();
}

To save on disk保存在磁盘上

System.IO.File.WriteAllBytes("filename.docx", generatedDocument.ToArray());

To return the file for download in net core mvc, use要在 net core mvc 中返回要下载的文件,请使用

return File(generatedDocument.ToArray(), 
          "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
          "filename.docx");

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM