简体   繁体   中英

Export docx/doc First Header and Footer as docx File Using openXML

I want to ask how can i convert Header/Footer part of MS Word Document (doc/docx) to HTML. I'm opening the Document like:

using (WordprocessingDocument wDoc = WordprocessingDocument.Open(memoryStream, true))

aka OpenXML

I'm converting the Document with WmlToHtmlConverter which converts the document excellent except that the headers and footers are skipt, cuz html standart doesnt support pagination. I was wondering how can i get them and extract them as html. I'm trying by getting them like :

using (WordprocessingDocument wdDoc = WordprocessingDocument.Open(mainFileMemoryStream, true))
{
    Document mainPart = wdDoc.MainDocumentPart.Document;
    DocumentFormat.OpenXml.Packaging.HeaderPart firstHeader =
            wdDoc.MainDocumentPart.HeaderParts.FirstOrDefault();

    if (firstHeader != null)
    {
        using (var headerStream = firstHeader.GetStream())
        {
            return headerStream.ReadFully();
        }
    }
    return null;
}

and then passing it to the Convertion Function, but i get exception which says:

File Contains Corrupted Data, with stack trace:

at System.IO.Packaging.ZipPackage..ctor(Stream s, FileMode packageFileMode, FileAccess packageFileAccess)
at System.IO.Packaging.Package.Open(Stream stream, FileMode packageMode, FileAccess packageAccess)
at DocumentFormat.OpenXml.Packaging.OpenXmlPackage.OpenCore(Stream stream, Boolean readWriteMode)
at DocumentFormat.OpenXml.Packaging.WordprocessingDocument.Open(Stream stream, Boolean isEditable, OpenSettings openSettings)
at DocumentFormat.OpenXml.Packaging.WordprocessingDocument.Open(Stream stream, Boolean isEditable)
at DocxToHTML.Converter.HTMLConverter.ParseDOCX(Byte[] fileInfo, String fileName) in D:\eTemida\eTemida.Web\DocxToHTML.Converter\HTMLConverter.cs:line 96

Any Help will be appreciated

a lot of struggle led me to the following solution:

I Created a function for converting byte array of docx Document to Html As Follows

public string ConvertToHtml(byte[] fileInfo, string fileName = "Default.docx")
    {
        if (string.IsNullOrEmpty(fileName) || Path.GetExtension(fileName) != ".docx")
            return "Unsupported format";

        //FileInfo fileInfo = new FileInfo(fullFilePath);

        string htmlText = string.Empty;
        try
        {
            htmlText = ParseDOCX(fileInfo, fileName);
        }
        catch (OpenXmlPackageException e)
        {

            if (e.ToString().Contains("Invalid Hyperlink"))
            {
                using (MemoryStream fs = new MemoryStream(fileInfo))
                {
                    UriFixer.FixInvalidUri(fs, brokenUri => FixUri(brokenUri));
                }
                htmlText = ParseDOCX(fileInfo, fileName);
            }
        }
        return htmlText;
    }

Where the ParseDOCX does all the convertion. The code of ParseDOCX :

private string ParseDOCX(byte[] fileInfo, string fileName)
    {
        try
        {
            //byte[] byteArray = File.ReadAllBytes(fileInfo.FullName);
            using (MemoryStream memoryStream = new MemoryStream())
            {
                memoryStream.Write(fileInfo, 0, fileInfo.Length);

                using (WordprocessingDocument wDoc = WordprocessingDocument.Open(memoryStream, true))
                {

                    int imageCounter = 0;

                    var pageTitle = fileName;
                    var part = wDoc.CoreFilePropertiesPart;
                    if (part != null)
                        pageTitle = (string)part.GetXDocument().Descendants(DC.title).FirstOrDefault() ?? fileName;

                    WmlToHtmlConverterSettings settings = new WmlToHtmlConverterSettings()
                    {
                        AdditionalCss = "body { margin: 1cm auto; max-width: 20cm; padding: 0; }",
                        PageTitle = pageTitle,
                        FabricateCssClasses = true,
                        CssClassPrefix = "pt-",
                        RestrictToSupportedLanguages = false,
                        RestrictToSupportedNumberingFormats = false,
                        ImageHandler = imageInfo =>
                        {
                            ++imageCounter;
                            string extension = imageInfo.ContentType.Split('/')[1].ToLower();
                            ImageFormat imageFormat = null;
                            if (extension == "png") imageFormat = ImageFormat.Png;
                            else if (extension == "gif") imageFormat = ImageFormat.Gif;
                            else if (extension == "bmp") imageFormat = ImageFormat.Bmp;
                            else if (extension == "jpeg") imageFormat = ImageFormat.Jpeg;
                            else if (extension == "tiff")
                            {
                                extension = "gif";
                                imageFormat = ImageFormat.Gif;
                            }
                            else if (extension == "x-wmf")
                            {
                                extension = "wmf";
                                imageFormat = ImageFormat.Wmf;
                            }

                            if (imageFormat == null)
                                return null;

                            string base64 = null;
                            try
                            {
                                using (MemoryStream ms = new MemoryStream())
                                {
                                    imageInfo.Bitmap.Save(ms, imageFormat);
                                    var ba = ms.ToArray();
                                    base64 = System.Convert.ToBase64String(ba);
                                }
                            }
                            catch (System.Runtime.InteropServices.ExternalException)
                            { return null; }


                            ImageFormat format = imageInfo.Bitmap.RawFormat;
                            ImageCodecInfo codec = ImageCodecInfo.GetImageDecoders().First(c => c.FormatID == format.Guid);
                            string mimeType = codec.MimeType;

                            string imageSource = string.Format("data:{0};base64,{1}", mimeType, base64);

                            XElement img = new XElement(Xhtml.img,
                                new XAttribute(NoNamespace.src, imageSource),
                                imageInfo.ImgStyleAttribute,
                                imageInfo.AltText != null ?
                                    new XAttribute(NoNamespace.alt, imageInfo.AltText) : null);
                            return img;
                        }

                    };
                    XElement htmlElement = WmlToHtmlConverter.ConvertToHtml(wDoc, settings);

                    var html = new XDocument(new XDocumentType("html", null, null, null), htmlElement);
                    var htmlString = html.ToString(SaveOptions.DisableFormatting);
                    return htmlString;
                }
            }
        }
        catch (Exception)
        {
            return "File contains corrupt data";
        }
    }

So far everything looked nice and easy but then i realized that the Header and the Footer of the Document are just skipt, so i had to somehow convert them. I tried to use the GetStream() Method of HeaderPart, but of course exception was throw, cuz the Header tree is not the same as the one of the Document.

Then i decided to extract the Header and Footer as new documents (having hard time with this) with openXML's WordprocessingDocument headerDoc = WordprocessingDocument.Create(headerStream,Document) but unfortunaly the convertion of this document was also unsuccsesful as you might thing, because this is just creating a plain docx document without any settings,styles,webSettings etc. . This took a lot of time to figute out.

SO finaly i decided to Create a new Document Via Cathal's DocX Library and it finaly came to live. The Code is as follows :

public string ConvertHeaderToHtml(HeaderPart header)
    {

        using (MemoryStream headerStream = new MemoryStream())
        {
            //Cathal's Docx Create
            var newDocument = Novacode.DocX.Create(headerStream);
            newDocument.Save();

            using (WordprocessingDocument headerDoc = WordprocessingDocument.Open(headerStream,true))
            {
                var headerParagraphs = new List<OpenXmlElement>(header.Header.Elements());
                var mainPart = headerDoc.MainDocumentPart;

                //Cloning the List is necesery because it will throw exception for the reason
                // that you are working with refferences of the Elements
                mainPart.Document.Body.Append(headerParagraphs.Select(h => (OpenXmlElement)h.Clone()).ToList());

                //Copies the Header RelationShips as Document's
                foreach (IdPartPair parts in header.Parts)
                {
                    //Very important second parameter of AddPart, if not set the relationship ID is being changed
                    // and the wordDocument pictures, etc. wont show
                    mainPart.AddPart(parts.OpenXmlPart,parts.RelationshipId);
                }
                headerDoc.MainDocumentPart.Document.Save();
                headerDoc.Save();
                headerDoc.Close();
            }
            return ConvertToHtml(headerStream.ToArray());
        }
    }

So that was with the Header. I'm passing the HeaderPart and getting its Header then Elements. Extracting the relationships, which is very important if you have images in the header, and importing them in the Document itself And the Document is Ready for convertion.

The same steps are used to Generate the Html out of the Footer.

Hope This will help some in his Duty.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM