So I have an HTML table in a string. Most of this HTML came from FrontPage so it is mostly badly formatted. Here's a quick sample of what it looks like.
<b>Table 1</b>
<table class='class1'>
<tr>
<td>
<p>Procedure Name</td>
<td>
<p>Procedure</td>
</tr>
</table>
<p><b>Table 2</b></p>
<table class='class2'>
<tr>
<td>
<p>Procedure Name</td>
<td>
<p>Procedure</td>
</tr>
</table>
<p> Some text is here</p>
From what I understand, FrontPage automatically adds a <p>
in every new cell.
I want to remove those <p>
tags that are inside the tables but keep the ones outside the tables. I tried 2 methods so far:
First method was to use a single RegEx tp capture every <p>
tag in the tables and then to Regex.Replace()
to remove them. However I never managed to get the right RegEx for this. (I know parsing HTML with RegEx is bad. I thought the data was simple enough to apply RegEx to it).
I can get everything in each table quite easily using this regex: <table.*?>(.*?)</table>
Then I wanted to only grab the <p>
tags so I wrote this: (?<=<table.*?>)(<p>)(?=</table>)
. This doesn't match anything. (Apparently .NET allows quantifiers in their lookbehinds. At least that's the impression I had while using http://regexhero.net/tester/ )
Any way I can modify this RegEx to capture only what I need?
Second method was to capture only the table contents into a string and then String.Replace()
to remove the <p>
tags. I'm using the following code to capture the matches:
MatchCollection tablematch = Regex.Matches(htmlSource, @"<table.*?>(.*?)</table>", RegexOptions.Singleline);
htmlSource
is a string containing the whole HTML page and this variable is what will be sent back to the client after processing. I want to remove only what I need to remove from htmlSource
.
How can I use the MatchCollection to remove the <p>
tags and then send the updated tables back to htmlSource
?
Thank you
This answer is based on the second suggested approach. Changed Regex to match everything inside table to :
<table.*?table>
And used Regex.Replace specifying MatchEvaluator to behave with desired replacing:
Regex myRegex = new Regex(@"<table.*?table>", RegexOptions.Singleline);
string replaced = myRegex.Replace(htmlSource, m=> m.Value.Replace("<p>",""));
Console.WriteLine(replaced);
Output using question input:
<b>Table 1</b>
<table class='class1'>
<tr>
<td>
Procedure Name</td>
<td>
Procedure</td>
</tr>
</table>
<p><b>Table 2</b></p>
<table class='class2'>
<tr>
<td>
Procedure Name</td>
<td>
Procedure</td>
</tr>
</table>
<p> Some text is here</p>
I guess by using a delegate (callback) it could be done.
string html = @"
<b>Table 1</b>
<table class='class1'>
<tr>
<td>
<p>Procedure Name</td>
<td>
<p>Procedure</td>
</tr>
</table>
<p><b>Table 2</b></p>
<table class='class2'>
<tr>
<td>
<p>Procedure Name</td>
<td>
<p>Procedure</td>
</tr>
</table>
<p> Some text is here</p>
";
Regex RxTable = new Regex( @"(?s)(<table[^>]*>)(.+?)(</table\s*>)" );
Regex RxP = new Regex( @"<p>" );
string htmlNew = RxTable.Replace(
html,
delegate(Match match)
{
return match.Groups[1].Value + RxP.Replace(match.Groups[2].Value, "") + match.Groups[3].Value;
}
);
Console.WriteLine( htmlNew );
Output:
<b>Table 1</b>
<table class='class1'>
<tr>
<td>
Procedure Name</td>
<td>
Procedure</td>
</tr>
</table>
<p><b>Table 2</b></p>
<table class='class2'>
<tr>
<td>
Procedure Name</td>
<td>
Procedure</td>
</tr>
</table>
<p> Some text is here</p>
Generally regex allows you to work with nested structures, it's very ugly and you should avoid it, but if you haven't other option, you can use it.
static void Main()
{
string s =
@"A()
{
for()
{
}
do
{
}
}
B()
{
for()
{
}
}
C()
{
for()
{
for()
{
}
}
}";
var r = new Regex(@"
{
(
[^{}] # everything except braces { }
|
(?<open> { ) # if { then push
|
(?<-open> } ) # if } then pop
)+
(?(open)(?!)) # true if stack is empty
}
", RegexOptions.IgnorePatternWhitespace | RegexOptions.ExplicitCapture);
int counter = 0;
foreach (Match m in r.Matches(s))
Console.WriteLine("Outer block #{0}\r\n{1}", ++counter, m.Value);
Console.Read();
}
here regex "knows" where block starts and where it ends, so you can use this information to remove <p>
tag if it haven't appropriate closing one.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.