简体   繁体   English

将XHTML5解析为XDocument

[英]Parse XHTML5 into XDocument

I need to parse XHTML5 files into XDocument instances. 我需要将XHTML5文件解析为XDocument实例。 My files will always be well-formed XML, so I want to avoid HtmlAgilityPack due to its permissiveness of malformed XHTML. 我的文件将始终是格式良好的XML,因此我希望避免使用HtmlAgilityPack,因为它存在格式错误的XHTML。 The XDocument.Load method works for simple cases, but breaks when the document contains named character references (entities): XDocument.Load方法适用于简单情况,但在文档包含命名字符引用(实体)时中断:

var xhtml = XDocument.Load(reader);
// XmlException: Reference to undeclared entity 'nbsp'. 

For XHTML 1.0, this issue could be resolved by using an XmlPreloadedResolver , which preloads the well-known DTDs that are defined in XHTML 1.0. 对于XHTML 1.0,可以使用XmlPreloadedResolver解决此问题,该预加载XHTML 1.0中定义的众所周知的DTD。 The approach can be extended to support XHTML 1.1 by manually providing its DTD, as shown in this answer . 通过手动提供其DTD,可以扩展该方法以支持XHTML 1.1,如本答案所示。

However, XHTML5 does not have a DTD, as discussed under this other answer . 但是,XHTML5没有DTD,正如另一个答案所讨论的那样。 Its entity definitions are provided for informational purposes as JSON . 其实体定义仅供参考, 如JSON

<!DOCTYPE html>

Consequently, the XmlResolver methods are never called when parsing entities in XHTML5. 因此,在XHTML5中解析实体时,永远不会调用XmlResolver方法。 There is a discussion of attempts for providing XmlReader with a list of entity declarations , but no approach seems to work out of the box. 讨论了为XmlReader提供实体声明列表的尝试,但似乎没有任何方法可以开箱即用。

Currently, there are two approaches I'm looking at. 目前,我正在研究两种方法。 The first is specifying an internal subset with the entity declarations in the document type declaration, either through string manipulation on the source XHTML, or through XmlParserContext.InternalSubset . 第一种是通过源XHTML上的字符串操作或通过XmlParserContext.InternalSubset在文档类型声明中指定具有实体声明的内部子集。 This would result in a document type declaration similar to: 这将导致文档类型声明类似于:

<!DOCTYPE html [
  <!ENTITY ndash "&#8211;">
  <!ENTITY nbsp "&#160;">
  ...
]>

It seems like this is allowed in XHTML5; 在XHTML5中似乎允许这样做; however, it is undesirable since it litters the XDocument with the entity declarations (of which there are now more than 2000 ), which will be problematic if the user converts it back to a string representation. 但是,它是不受欢迎的,因为它使用实体声明(现在有超过2000个 )来XDocument ,如果用户将其转换回字符串表示,这将是有问题的。

My other approach is to preprocess the XHTML string using regex to convert all the named character references into numeric character references (or into the actual Unicode characters), excluding the XML predefined entities, " & ' < > . However, I'm concerned that there are complexities in the definition of XML that this approach might miss. For example, this answer indicates that characters must not be escaped in comments, CDATA sections, or processing instructions. I assume that my regex would need to be tweaked to exclude all these occurrences. 我的另一种方法是使用正则表达式预处理XHTML字符串,将所有命名的字符引用转换为数字字符引用(或实际的Unicode字符),不包括XML预定义实体, " & ' < > 。但是,我担心的是这种方法可能会错过XML的定义中的复杂性。例如, 这个答案表明不能在注释,CDATA部分或处理指令中转义字符。我假设我的正则表达式需要调整以排除所有这些发生。

Does anyone have experience or recommendations on the two approaches, or any other approach you'd consider? 有没有人对这两种方法或您考虑的任何其他方法有经验或建议? I would prefer approaches that build on XmlReader 's extensibility, but will resort to source string manipulation if there is no other way. 我更喜欢基于XmlReader的可扩展性的方法,但如果没有其他方法,将采用源字符串操作。

If you apply the identity translate to your source document with the entity map in place, it would substitute the actual characters for you in the result. 如果您将身份转换应用于具有实体映射的源文档,它将在结果中替换您的实际字符。 To me, this is no different (one step) as the regex and certainly much less complex. 对我来说,这与正则表达式没有什么不同(一步),当然也不那么复杂。

Given this source: 鉴于此来源:

<!DOCTYPE foo [
 <!ENTITY ndash "&#8211;">
 <!ENTITY nbsp "&#160;">
]>
<foo>
  <p>I am &ndash; and I am&nbsp;non-breaking space.</p>
</foo>

And this transform: 而这个转变:

        <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
        version="1.0">
        <xsl:template match="@*|node()">
            <xsl:copy>
                <xsl:apply-templates select="@*|node()"/>
            </xsl:copy>
        </xsl:template>
    </xsl:stylesheet>

You would have this result as your new input: 您将此结果作为新输入:

<foo>
   <p>I am – and I am non-breaking space.</p>
</foo>

Further, you could just keep all those definitions in a separate file and add one reference to them like this: 此外,您可以将所有这些定义保存在单独的文件中,并像这样添加一个引用:

<!ENTITY % winansi SYSTEM "path/to/my/map/winansi.xml">  %winansi;]>

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM