简体   繁体   中英

Find & replace text not already inside an <A> tag - RegEx .Net

I am working with XML data in .NET from the Federal Register, which contain many references to Executive Orders & chapters from the US Code.

I'd like to be able to hyperlink to these references, unless they're already inside of an <a> tag (which is determined by the XML, and often links within the document itself).

The pattern I've written is matching and deleting leading and trailing characters and not displaying them, even if I include the boundary character in the replacement string:

[?!<a href="#(.*)">]([0-9]{1,2})[ ]{0,1}(U\.S\.C\.|USC)[\s]{0,1}([0-9]{1,5})(\b)[^</a>]

An example of the initial XML:

<p>The Regulatory Flexibility Act of 1980 (RFA), 5 U.S.C. 604(b), as amended, requires Federal agencies to consider the potential impact of regulations on small entities during rulemaking.</p>
<p>Small entities include small businesses, small not-for-profit organizations, and small governmental jurisdictions.</p>
<p>Section 605 of the RFA allows an agency to certify a rule, in lieu of preparing an analysis, if the rulemaking is not expected to have a significant economic impact on a substantial number of small entities. Reference: <a href="#1">13 USC 401</a></p>
  <ul>
      <li><em>Related laws from 14USC301-345 do not apply.</em></li>
      <li><a href="#2">14 USC 301</a> does apply.</li>
  </ul>

As you can see, some references include ranges of US Code sections (eg 14 USC 301-345) or references to specific subsections (eg 5 USC 604(b) ). I'd only want to link to the first reference in the range, so the link should terminate at the - or the ( .

If I'm understanding you correctly, I think the following should work.

var re = new Regex(@"\d{1,2}\s?U\.?S\.?C\.?\s?\d{1,5}\b(?!</a>)");
var matches = re.Matches(text);

// matches[0].Value = 5 U.S.C. 604
// matches[1].Value = 14USC301

You might even be able to simplify the regex to \\d+\\s?U\\.?S\\.?C\\.?\\s?\\d+\\b(?!</a>) – I'm not sure if the upper limits of 2 and 5 are significant.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM