Remove parts of Regex.Match string

Question

So I have an HTML table in a string. Most of this HTML came from FrontPage so it is mostly badly formatted. Here's a quick sample of what it looks like.

<b>Table 1</b>
  <table class='class1'>
  <tr>
    <td>
      <p>Procedure Name</td>
    <td>
        <p>Procedure</td>
    </tr>
  </table>
<p><b>Table 2</b></p>
  <table class='class2'>
    <tr>
      <td>
        <p>Procedure Name</td>
        <td>
        <p>Procedure</td>
    </tr>
  </table>
<p> Some text is here</p>

From what I understand, FrontPage automatically adds a <p> in every new cell.

I want to remove those <p> tags that are inside the tables but keep the ones outside the tables. I tried 2 methods so far:

First method

First method was to use a single RegEx tp capture every <p> tag in the tables and then to Regex.Replace() to remove them. However I never managed to get the right RegEx for this. (I know parsing HTML with RegEx is bad. I thought the data was simple enough to apply RegEx to it).

I can get everything in each table quite easily using this regex: <table.*?>(.*?)</table>

Then I wanted to only grab the <p> tags so I wrote this: (?<=<table.*?>)(<p>)(?=</table>) . This doesn't match anything. (Apparently .NET allows quantifiers in their lookbehinds. At least that's the impression I had while using http://regexhero.net/tester/ )

Any way I can modify this RegEx to capture only what I need?

Second method

Second method was to capture only the table contents into a string and then String.Replace() to remove the <p> tags. I'm using the following code to capture the matches:

MatchCollection tablematch = Regex.Matches(htmlSource, @"<table.*?>(.*?)</table>", RegexOptions.Singleline);

htmlSource is a string containing the whole HTML page and this variable is what will be sent back to the client after processing. I want to remove only what I need to remove from htmlSource .

How can I use the MatchCollection to remove the <p> tags and then send the updated tables back to htmlSource ?

Thank you

Answer 1

This answer is based on the second suggested approach. Changed Regex to match everything inside table to :

<table.*?table>

And used Regex.Replace specifying MatchEvaluator to behave with desired replacing:

Regex myRegex = new Regex(@"<table.*?table>", RegexOptions.Singleline);
string replaced = myRegex.Replace(htmlSource, m=> m.Value.Replace("<p>",""));
Console.WriteLine(replaced);

Output using question input:

<b>Table 1</b>
    <table class='class1'>
    <tr>
    <td>
        Procedure Name</td>
    <td>
        Procedure</td>
    </tr>
    </table>
<p><b>Table 2</b></p>
    <table class='class2'>
    <tr>
        <td>
        Procedure Name</td>
        <td>
        Procedure</td>
    </tr>
    </table>
<p> Some text is here</p>

Answer 2

I guess by using a delegate (callback) it could be done.

string html = @"
<b>Table 1</b>
  <table class='class1'>
  <tr>
    <td>
      <p>Procedure Name</td>
    <td>
        <p>Procedure</td>
    </tr>
  </table>
<p><b>Table 2</b></p>
  <table class='class2'>
    <tr>
      <td>
        <p>Procedure Name</td>
        <td>
        <p>Procedure</td>
    </tr>
  </table>
<p> Some text is here</p>
";

Regex RxTable = new Regex( @"(?s)(<table[^>]*>)(.+?)(</table\s*>)" );
Regex RxP = new Regex( @"<p>" );

string htmlNew = RxTable.Replace( 
    html,
    delegate(Match match)
    {
       return match.Groups[1].Value + RxP.Replace(match.Groups[2].Value, "") + match.Groups[3].Value;
    }
);
Console.WriteLine( htmlNew );

Output:

<b>Table 1</b>
  <table class='class1'>
  <tr>
    <td>
      Procedure Name</td>
    <td>
        Procedure</td>
    </tr>
  </table>
<p><b>Table 2</b></p>
  <table class='class2'>
    <tr>
      <td>
        Procedure Name</td>
        <td>
        Procedure</td>
    </tr>
  </table>
<p> Some text is here</p>

Answer 3

Generally regex allows you to work with nested structures, it's very ugly and you should avoid it, but if you haven't other option, you can use it.

static void Main()
{
    string s = 
@"A()
{
    for()
    {
    }
    do
    {
    }
}
B()
{
    for()
    {
    }   
}
C()
{
    for()
    {
        for()
        {
        }
    }   
}";

    var r = new Regex(@"  
                      {                       
                          (                 
                              [^{}]           # everything except braces { }   
                              |
                              (?<open>  { )   # if { then push
                              |
                              (?<-open> } )   # if } then pop
                          )+
                          (?(open)(?!))       # true if stack is empty
                      }                                                                  

                    ", RegexOptions.IgnorePatternWhitespace | RegexOptions.ExplicitCapture);

    int counter = 0;

    foreach (Match m in r.Matches(s))
        Console.WriteLine("Outer block #{0}\r\n{1}", ++counter, m.Value);

    Console.Read();
}

here regex "knows" where block starts and where it ends, so you can use this information to remove <p> tag if it haven't appropriate closing one.

Remove parts of Regex.Match string

Question

First method

Second method

3 answers

solution1
1 ACCPTED 2015-06-08 17:59:24

solution2
1 2015-06-08 18:11:18

solution3
0 2015-06-08 16:00:22

Remove parts of Regex.Match string

Question

First method

Second method

3 answers

solution1 1 ACCPTED 2015-06-08 17:59:24

solution2 1 2015-06-08 18:11:18

solution3 0 2015-06-08 16:00:22

solution1
1 ACCPTED 2015-06-08 17:59:24

solution2
1 2015-06-08 18:11:18

solution3
0 2015-06-08 16:00:22