使用 openXML 将 docx/doc 第一个页眉和页脚导出为 docx 文件

Question

I want to ask how can i convert Header/Footer part of MS Word Document (doc/docx) to HTML.我想问一下如何将 MS Word 文档 (doc/docx) 的页眉/页脚部分转换为 HTML。 I'm opening the Document like:我正在打开文档，如：

using (WordprocessingDocument wDoc = WordprocessingDocument.Open(memoryStream, true))

aka OpenXML又名 OpenXML

I'm converting the Document with WmlToHtmlConverter which converts the document excellent except that the headers and footers are skipt, cuz html standart doesnt support pagination.我正在使用WmlToHtmlConverter转换文档，它可以很好地转换文档，除了页眉和页脚被跳过，因为 html 标准不支持分页。 I was wondering how can i get them and extract them as html.我想知道如何获取它们并将它们提取为 html。 I'm trying by getting them like :我正在尝试让他们像：

using (WordprocessingDocument wdDoc = WordprocessingDocument.Open(mainFileMemoryStream, true))
{
    Document mainPart = wdDoc.MainDocumentPart.Document;
    DocumentFormat.OpenXml.Packaging.HeaderPart firstHeader =
            wdDoc.MainDocumentPart.HeaderParts.FirstOrDefault();

    if (firstHeader != null)
    {
        using (var headerStream = firstHeader.GetStream())
        {
            return headerStream.ReadFully();
        }
    }
    return null;
}

and then passing it to the Convertion Function, but i get exception which says:然后将它传递给转换函数，但我得到异常，它说：

File Contains Corrupted Data, with stack trace:文件包含损坏的数据，带有堆栈跟踪：

at System.IO.Packaging.ZipPackage..ctor(Stream s, FileMode packageFileMode, FileAccess packageFileAccess)
at System.IO.Packaging.Package.Open(Stream stream, FileMode packageMode, FileAccess packageAccess)
at DocumentFormat.OpenXml.Packaging.OpenXmlPackage.OpenCore(Stream stream, Boolean readWriteMode)
at DocumentFormat.OpenXml.Packaging.WordprocessingDocument.Open(Stream stream, Boolean isEditable, OpenSettings openSettings)
at DocumentFormat.OpenXml.Packaging.WordprocessingDocument.Open(Stream stream, Boolean isEditable)
at DocxToHTML.Converter.HTMLConverter.ParseDOCX(Byte[] fileInfo, String fileName) in D:\eTemida\eTemida.Web\DocxToHTML.Converter\HTMLConverter.cs:line 96

Any Help will be appreciated任何帮助将不胜感激

Answer 1

a lot of struggle led me to the following solution:很多挣扎使我找到了以下解决方案：

I Created a function for converting byte array of docx Document to Html As Follows我创建了一个将 docx 文档的字节数组转换为 Html 的函数，如下所示

public string ConvertToHtml(byte[] fileInfo, string fileName = "Default.docx")
    {
        if (string.IsNullOrEmpty(fileName) || Path.GetExtension(fileName) != ".docx")
            return "Unsupported format";

        //FileInfo fileInfo = new FileInfo(fullFilePath);

        string htmlText = string.Empty;
        try
        {
            htmlText = ParseDOCX(fileInfo, fileName);
        }
        catch (OpenXmlPackageException e)
        {

            if (e.ToString().Contains("Invalid Hyperlink"))
            {
                using (MemoryStream fs = new MemoryStream(fileInfo))
                {
                    UriFixer.FixInvalidUri(fs, brokenUri => FixUri(brokenUri));
                }
                htmlText = ParseDOCX(fileInfo, fileName);
            }
        }
        return htmlText;
    }

Where the ParseDOCX does all the convertion. ParseDOCX 完成所有转换的地方。 The code of ParseDOCX : ParseDOCX 的代码：

private string ParseDOCX(byte[] fileInfo, string fileName)
    {
        try
        {
            //byte[] byteArray = File.ReadAllBytes(fileInfo.FullName);
            using (MemoryStream memoryStream = new MemoryStream())
            {
                memoryStream.Write(fileInfo, 0, fileInfo.Length);

                using (WordprocessingDocument wDoc = WordprocessingDocument.Open(memoryStream, true))
                {

                    int imageCounter = 0;

                    var pageTitle = fileName;
                    var part = wDoc.CoreFilePropertiesPart;
                    if (part != null)
                        pageTitle = (string)part.GetXDocument().Descendants(DC.title).FirstOrDefault() ?? fileName;

                    WmlToHtmlConverterSettings settings = new WmlToHtmlConverterSettings()
                    {
                        AdditionalCss = "body { margin: 1cm auto; max-width: 20cm; padding: 0; }",
                        PageTitle = pageTitle,
                        FabricateCssClasses = true,
                        CssClassPrefix = "pt-",
                        RestrictToSupportedLanguages = false,
                        RestrictToSupportedNumberingFormats = false,
                        ImageHandler = imageInfo =>
                        {
                            ++imageCounter;
                            string extension = imageInfo.ContentType.Split('/')[1].ToLower();
                            ImageFormat imageFormat = null;
                            if (extension == "png") imageFormat = ImageFormat.Png;
                            else if (extension == "gif") imageFormat = ImageFormat.Gif;
                            else if (extension == "bmp") imageFormat = ImageFormat.Bmp;
                            else if (extension == "jpeg") imageFormat = ImageFormat.Jpeg;
                            else if (extension == "tiff")
                            {
                                extension = "gif";
                                imageFormat = ImageFormat.Gif;
                            }
                            else if (extension == "x-wmf")
                            {
                                extension = "wmf";
                                imageFormat = ImageFormat.Wmf;
                            }

                            if (imageFormat == null)
                                return null;

                            string base64 = null;
                            try
                            {
                                using (MemoryStream ms = new MemoryStream())
                                {
                                    imageInfo.Bitmap.Save(ms, imageFormat);
                                    var ba = ms.ToArray();
                                    base64 = System.Convert.ToBase64String(ba);
                                }
                            }
                            catch (System.Runtime.InteropServices.ExternalException)
                            { return null; }


                            ImageFormat format = imageInfo.Bitmap.RawFormat;
                            ImageCodecInfo codec = ImageCodecInfo.GetImageDecoders().First(c => c.FormatID == format.Guid);
                            string mimeType = codec.MimeType;

                            string imageSource = string.Format("data:{0};base64,{1}", mimeType, base64);

                            XElement img = new XElement(Xhtml.img,
                                new XAttribute(NoNamespace.src, imageSource),
                                imageInfo.ImgStyleAttribute,
                                imageInfo.AltText != null ?
                                    new XAttribute(NoNamespace.alt, imageInfo.AltText) : null);
                            return img;
                        }

                    };
                    XElement htmlElement = WmlToHtmlConverter.ConvertToHtml(wDoc, settings);

                    var html = new XDocument(new XDocumentType("html", null, null, null), htmlElement);
                    var htmlString = html.ToString(SaveOptions.DisableFormatting);
                    return htmlString;
                }
            }
        }
        catch (Exception)
        {
            return "File contains corrupt data";
        }
    }

So far everything looked nice and easy but then i realized that the Header and the Footer of the Document are just skipt, so i had to somehow convert them.到目前为止，一切看起来都很简单，但后来我意识到文档的页眉和页脚只是被跳过了，所以我不得不以某种方式转换它们。 I tried to use the GetStream() Method of HeaderPart, but of course exception was throw, cuz the Header tree is not the same as the one of the Document.我尝试使用 HeaderPart 的GetStream()方法，但是当然抛出了异常，因为 Header 树与 Document 的树不同。

Then i decided to extract the Header and Footer as new documents (having hard time with this) with openXML's WordprocessingDocument headerDoc = WordprocessingDocument.Create(headerStream,Document) but unfortunaly the convertion of this document was also unsuccsesful as you might thing, because this is just creating a plain docx document without any settings,styles,webSettings etc. .然后我决定使用 openXML 的WordprocessingDocument headerDoc = WordprocessingDocument.Create(headerStream,Document)将页眉和页脚提取为新文档（很难处理WordprocessingDocument headerDoc = WordprocessingDocument.Create(headerStream,Document)但不幸的是，此文档的转换也没有成功，因为这是只是创建一个没有任何设置、样式、webSettings 等的普通 docx 文档。 This took a lot of time to figute out.这花了很多时间来弄清楚。

SO finaly i decided to Create a new Document Via Cathal's DocX Library and it finaly came to live.所以最后我决定通过 Cathal 的 DocX 库创建一个新文档，它终于上线了。 The Code is as follows :代码如下：

public string ConvertHeaderToHtml(HeaderPart header)
    {

        using (MemoryStream headerStream = new MemoryStream())
        {
            //Cathal's Docx Create
            var newDocument = Novacode.DocX.Create(headerStream);
            newDocument.Save();

            using (WordprocessingDocument headerDoc = WordprocessingDocument.Open(headerStream,true))
            {
                var headerParagraphs = new List<OpenXmlElement>(header.Header.Elements());
                var mainPart = headerDoc.MainDocumentPart;

                //Cloning the List is necesery because it will throw exception for the reason
                // that you are working with refferences of the Elements
                mainPart.Document.Body.Append(headerParagraphs.Select(h => (OpenXmlElement)h.Clone()).ToList());

                //Copies the Header RelationShips as Document's
                foreach (IdPartPair parts in header.Parts)
                {
                    //Very important second parameter of AddPart, if not set the relationship ID is being changed
                    // and the wordDocument pictures, etc. wont show
                    mainPart.AddPart(parts.OpenXmlPart,parts.RelationshipId);
                }
                headerDoc.MainDocumentPart.Document.Save();
                headerDoc.Save();
                headerDoc.Close();
            }
            return ConvertToHtml(headerStream.ToArray());
        }
    }

So that was with the Header.这就是标题。 I'm passing the HeaderPart and getting its Header then Elements.我正在传递 HeaderPart 并获取其 Header 然后是 Elements。 Extracting the relationships, which is very important if you have images in the header, and importing them in the Document itself And the Document is Ready for convertion.提取关系，这对于标题中有图像非常重要，并将它们导入文档本身并且文档已准备好进行转换。

The same steps are used to Generate the Html out of the Footer.使用相同的步骤从页脚生成 Html。

Hope This will help some in his Duty.希望这对他的职责有所帮助。

使用 openXML 将 docx/doc 第一个页眉和页脚导出为 docx 文件

问题描述

1 个解决方案

解决方案1
2 2017-10-11 11:37:40

使用 openXML 将 docx/doc 第一个页眉和页脚导出为 docx 文件

问题描述

1 个解决方案

解决方案1 2 2017-10-11 11:37:40

解决方案1
2 2017-10-11 11:37:40