简体   繁体   中英

Remove parts of Regex.Match string

So I have an HTML table in a string. Most of this HTML came from FrontPage so it is mostly badly formatted. Here's a quick sample of what it looks like.

<b>Table 1</b>
  <table class='class1'>
      <p>Procedure Name</td>
<p><b>Table 2</b></p>
  <table class='class2'>
        <p>Procedure Name</td>
<p> Some text is here</p>

From what I understand, FrontPage automatically adds a <p> in every new cell.

I want to remove those <p> tags that are inside the tables but keep the ones outside the tables. I tried 2 methods so far:

First method

First method was to use a single RegEx tp capture every <p> tag in the tables and then to Regex.Replace() to remove them. However I never managed to get the right RegEx for this. (I know parsing HTML with RegEx is bad. I thought the data was simple enough to apply RegEx to it).

I can get everything in each table quite easily using this regex: <table.*?>(.*?)</table>

Then I wanted to only grab the <p> tags so I wrote this: (?<=<table.*?>)(<p>)(?=</table>) . This doesn't match anything. (Apparently .NET allows quantifiers in their lookbehinds. At least that's the impression I had while using http://regexhero.net/tester/ )

Any way I can modify this RegEx to capture only what I need?

Second method

Second method was to capture only the table contents into a string and then String.Replace() to remove the <p> tags. I'm using the following code to capture the matches:

MatchCollection tablematch = Regex.Matches(htmlSource, @"<table.*?>(.*?)</table>", RegexOptions.Singleline);

htmlSource is a string containing the whole HTML page and this variable is what will be sent back to the client after processing. I want to remove only what I need to remove from htmlSource .

How can I use the MatchCollection to remove the <p> tags and then send the updated tables back to htmlSource ?

Thank you

This answer is based on the second suggested approach. Changed Regex to match everything inside table to :


And used Regex.Replace specifying MatchEvaluator to behave with desired replacing:

Regex myRegex = new Regex(@"<table.*?table>", RegexOptions.Singleline);
string replaced = myRegex.Replace(htmlSource, m=> m.Value.Replace("<p>",""));

Output using question input:

<b>Table 1</b>
    <table class='class1'>
        Procedure Name</td>
<p><b>Table 2</b></p>
    <table class='class2'>
        Procedure Name</td>
<p> Some text is here</p>

I guess by using a delegate (callback) it could be done.

string html = @"
<b>Table 1</b>
  <table class='class1'>
      <p>Procedure Name</td>
<p><b>Table 2</b></p>
  <table class='class2'>
        <p>Procedure Name</td>
<p> Some text is here</p>

Regex RxTable = new Regex( @"(?s)(<table[^>]*>)(.+?)(</table\s*>)" );
Regex RxP = new Regex( @"<p>" );

string htmlNew = RxTable.Replace( 
    delegate(Match match)
       return match.Groups[1].Value + RxP.Replace(match.Groups[2].Value, "") + match.Groups[3].Value;
Console.WriteLine( htmlNew );


<b>Table 1</b>
  <table class='class1'>
      Procedure Name</td>
<p><b>Table 2</b></p>
  <table class='class2'>
        Procedure Name</td>
<p> Some text is here</p>

Generally regex allows you to work with nested structures, it's very ugly and you should avoid it, but if you haven't other option, you can use it.

static void Main()
    string s = 

    var r = new Regex(@"  
                              [^{}]           # everything except braces { }   
                              (?<open>  { )   # if { then push
                              (?<-open> } )   # if } then pop
                          (?(open)(?!))       # true if stack is empty

                    ", RegexOptions.IgnorePatternWhitespace | RegexOptions.ExplicitCapture);

    int counter = 0;

    foreach (Match m in r.Matches(s))
        Console.WriteLine("Outer block #{0}\r\n{1}", ++counter, m.Value);


here regex "knows" where block starts and where it ends, so you can use this information to remove <p> tag if it haven't appropriate closing one.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM