简体   繁体   English

查找并替换<A>标记中</a>尚未包含的文本<A>-RegEx .Net</a>

[英]Find & replace text not already inside an <A> tag - RegEx .Net

I am working with XML data in .NET from the Federal Register, which contain many references to Executive Orders & chapters from the US Code. 我正在使用联邦注册局(.fed)中的.NET中的XML数据,其中包含对美国法规中行政命令和各章的大量引用。

I'd like to be able to hyperlink to these references, unless they're already inside of an <a> tag (which is determined by the XML, and often links within the document itself). 我希望能够超链接到这些引用,除非它们已经在<a>标记内(该标记由XML决定,并且通常是文档本身内的链接)。

The pattern I've written is matching and deleting leading and trailing characters and not displaying them, even if I include the boundary character in the replacement string: 我编写的模式是匹配和删除前导和尾随字符,并且不显示它们,即使我在替换字符串中包含边界字符也是如此:

[?!<a href="#(.*)">]([0-9]{1,2})[ ]{0,1}(U\.S\.C\.|USC)[\s]{0,1}([0-9]{1,5})(\b)[^</a>]

An example of the initial XML: 初始XML的示例:

<p>The Regulatory Flexibility Act of 1980 (RFA), 5 U.S.C. 604(b), as amended, requires Federal agencies to consider the potential impact of regulations on small entities during rulemaking.</p>
<p>Small entities include small businesses, small not-for-profit organizations, and small governmental jurisdictions.</p>
<p>Section 605 of the RFA allows an agency to certify a rule, in lieu of preparing an analysis, if the rulemaking is not expected to have a significant economic impact on a substantial number of small entities. Reference: <a href="#1">13 USC 401</a></p>
  <ul>
      <li><em>Related laws from 14USC301-345 do not apply.</em></li>
      <li><a href="#2">14 USC 301</a> does apply.</li>
  </ul>

As you can see, some references include ranges of US Code sections (eg 14 USC 301-345) or references to specific subsections (eg 5 USC 604(b) ). 如您所见,某些参考文献包括美国法规部分的范围(例如14 USC 301-345)或特定子节的参考范围(例如5 USC 604(b))。 I'd only want to link to the first reference in the range, so the link should terminate at the - or the ( . 我只想链接到该范围内的第一个引用,因此链接应以-(

If I'm understanding you correctly, I think the following should work. 如果我正确地理解了您,则我认为以下方法应该有效。

var re = new Regex(@"\d{1,2}\s?U\.?S\.?C\.?\s?\d{1,5}\b(?!</a>)");
var matches = re.Matches(text);

// matches[0].Value = 5 U.S.C. 604
// matches[1].Value = 14USC301

You might even be able to simplify the regex to \\d+\\s?U\\.?S\\.?C\\.?\\s?\\d+\\b(?!</a>) – I'm not sure if the upper limits of 2 and 5 are significant. 您甚至可以将正则表达式简化为\\d+\\s?U\\.?S\\.?C\\.?\\s?\\d+\\b(?!</a>) –我不确定2和5的上限很重要。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM