简体   繁体   中英

Convert Docx to html using OpenXml power tools without formatting

I'm using OpenXml Power tools in my project to convert a document (docx) into html, using the code already provided with this sdk it produces an elegant duplicate in html form.(Github link : https://github.com/OfficeDev/Open-Xml-PowerTools/blob/vNext/OpenXmlPowerToolsExamples/HtmlConverter01/HtmlConverter01.cs )

However looking at the html markup, the html has embedded styling.

Is there any way of turning this off and using plain and simple <h1> and <p> tags ?

I would like to know this embedded styling as the formatting would be taken care of by bootstrap.

The embedded styling is as follows :

 <p dir="ltr" style="font-family: Calibri;font-size: 11pt;line-height: 115.0%;margin-bottom: 0;margin-left: 0;margin-right: 0;margin-top: 0;">
 <span xml:space="preserve" style="font-size: 11pt;font-style: normal;font-weight: normal;margin: 0;padding: 0;"> </span>
 </p>

This as you can see is fine if you want a direct copy, but not if you want to control the style yourself.

In the C# code i have already made the following ajustments :

  • AdditionalCss is commented out
  • FabricateCssClasses is false
  • CssClassPrefix is commented out

Many thanks.

If you can also the XmlReader and XmlWriter to obtain a bare bone html. This could however be a little overkill, as only the tag itself and its text content will be kept.

public static class HtmlHelper
{
    /// <summary>
    /// Keep only the openning and closing tag, and text content from the html
    /// </summary>
    public static string CleanUp(string html)
    {
        var output = new StringBuilder();
        using (var reader = XmlReader.Create(new StringReader(html)))
        {
            var settings = new XmlWriterSettings() { Indent = true, OmitXmlDeclaration = true };
            using (var writer = XmlWriter.Create(output, settings))
            {
                while (reader.Read())
                {
                    switch (reader.NodeType)
                    {
                        case XmlNodeType.Element:
                            writer.WriteStartElement(reader.Name);
                            break;
                        case XmlNodeType.Text:
                            writer.WriteString(reader.Value);
                            break;
                        case XmlNodeType.EndElement:
                            writer.WriteFullEndElement();
                            break;
                    }
                }
            }
        }

        return output.ToString();
    }
}

Resulting output :

<p>
  <span></span>
</p>

I have solved this with a hint from Xiaoy312...

with the following, while using the example above the resulting html string can be loaded into the html agility pack, like so ...

HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(htmlString);

Then looking for the attribues (style and any others) remove them.

var styles = htmlDoc.DocumentNode.SelectNodes("//@style");                    
if (styles != null)
{
foreach (var item in styles)
{
item.Attributes["style"].Remove();
}
}

and then save the file.

var fileName = Path.Combine(outputDirectory,"index.html");
htmlDoc.Save(new FileStream(fileName,FileMode.Create,FileAccess.ReadWrite));

There will be other ways of doing this, but seems like an acceptable work around.

EDIT:

After some experimenting with both answers posted here, i found this implementation to work the best as it does not have an issue with images.

 var body = htmlDoc.DocumentNode.SelectSingleNode("//body");
 var tags = body.SelectNodes("//*");
 if (tags != null)
 {
  foreach (var tag in tags){
      if (!tag.OuterHtml.Contains("img"))
      {
       tag.Attributes.RemoveAll();
      }
    }
  }

In theory you can also use this for tables, however depending on the styling you want you can always strip out the attributes generated by power tools and replace the attributes with your own.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM