简体   繁体   English

使用OpenXml电动工具将Docx转换为html而不进行格式化

[英]Convert Docx to html using OpenXml power tools without formatting

I'm using OpenXml Power tools in my project to convert a document (docx) into html, using the code already provided with this sdk it produces an elegant duplicate in html form.(Github link : https://github.com/OfficeDev/Open-Xml-PowerTools/blob/vNext/OpenXmlPowerToolsExamples/HtmlConverter01/HtmlConverter01.cs ) 我在项目中使用OpenXml Power工具将文档(docx)转换为html,使用此sdk已提供的代码,它会以html形式生成优雅的副本。(Github链接: https : //github.com/OfficeDev /Open-Xml-PowerTools/blob/vNext/OpenXmlPowerToolsExamples/HtmlConverter01/HtmlConverter01.cs

However looking at the html markup, the html has embedded styling. 但是,从html标记来看,html具有嵌入式样式。

Is there any way of turning this off and using plain and simple <h1> and <p> tags ? 有什么办法可以关闭它,并使用简单的<h1><p>标签?

I would like to know this embedded styling as the formatting would be taken care of by bootstrap. 我想知道这种嵌入式样式,因为格式化将由引导程序处理。

The embedded styling is as follows : 嵌入式样式如下:

 <p dir="ltr" style="font-family: Calibri;font-size: 11pt;line-height: 115.0%;margin-bottom: 0;margin-left: 0;margin-right: 0;margin-top: 0;">
 <span xml:space="preserve" style="font-size: 11pt;font-style: normal;font-weight: normal;margin: 0;padding: 0;"> </span>
 </p>

This as you can see is fine if you want a direct copy, but not if you want to control the style yourself. 如您所见,如果您想要直接复制,则很好,但如果您要自己控制样式,则不行。

In the C# code i have already made the following ajustments : 在C#代码中,我已经进行了以下调整:

  • AdditionalCss is commented out AdditionalCss已被注释掉
  • FabricateCssClasses is false FabricateCssClasses为假
  • CssClassPrefix is commented out CssClassPrefix被注释掉

Many thanks. 非常感谢。

If you can also the XmlReader and XmlWriter to obtain a bare bone html. 如果您还可以通过XmlReaderXmlWriter获得裸露的html。 This could however be a little overkill, as only the tag itself and its text content will be kept. 但是,这可能有点过大,因为仅保留标签本身及其文本内容。

public static class HtmlHelper
{
    /// <summary>
    /// Keep only the openning and closing tag, and text content from the html
    /// </summary>
    public static string CleanUp(string html)
    {
        var output = new StringBuilder();
        using (var reader = XmlReader.Create(new StringReader(html)))
        {
            var settings = new XmlWriterSettings() { Indent = true, OmitXmlDeclaration = true };
            using (var writer = XmlWriter.Create(output, settings))
            {
                while (reader.Read())
                {
                    switch (reader.NodeType)
                    {
                        case XmlNodeType.Element:
                            writer.WriteStartElement(reader.Name);
                            break;
                        case XmlNodeType.Text:
                            writer.WriteString(reader.Value);
                            break;
                        case XmlNodeType.EndElement:
                            writer.WriteFullEndElement();
                            break;
                    }
                }
            }
        }

        return output.ToString();
    }
}

Resulting output : 结果输出:

<p>
  <span></span>
</p>

I have solved this with a hint from Xiaoy312... 我已经通过Xiaoy312的提示解决了这个问题...

with the following, while using the example above the resulting html string can be loaded into the html agility pack, like so ... 使用以下示例,在使用上面的示例时,可以将生成的html字符串加载到html敏捷包中,如下所示:

HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(htmlString);

Then looking for the attribues (style and any others) remove them. 然后寻找属性(样式和其他属性),将其删除。

var styles = htmlDoc.DocumentNode.SelectNodes("//@style");                    
if (styles != null)
{
foreach (var item in styles)
{
item.Attributes["style"].Remove();
}
}

and then save the file. 然后保存文件。

var fileName = Path.Combine(outputDirectory,"index.html");
htmlDoc.Save(new FileStream(fileName,FileMode.Create,FileAccess.ReadWrite));

There will be other ways of doing this, but seems like an acceptable work around. 会有其他方法可以执行此操作,但似乎可以接受。

EDIT: 编辑:

After some experimenting with both answers posted here, i found this implementation to work the best as it does not have an issue with images. 经过对此处发布的两个答案进行一些试验后,我发现此实现效果最佳,因为它对图像没有问题。

 var body = htmlDoc.DocumentNode.SelectSingleNode("//body");
 var tags = body.SelectNodes("//*");
 if (tags != null)
 {
  foreach (var tag in tags){
      if (!tag.OuterHtml.Contains("img"))
      {
       tag.Attributes.RemoveAll();
      }
    }
  }

In theory you can also use this for tables, however depending on the styling you want you can always strip out the attributes generated by power tools and replace the attributes with your own. 从理论上讲,您也可以将其用于表格,但是根据所需的样式,您总是可以去除电动工具生成的属性,并用自己的属性替换。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM