简体   繁体   English

删除部分Regex.Match字符串

[英]Remove parts of Regex.Match string

So I have an HTML table in a string. 所以我在一个字符串中有一个HTML表。 Most of this HTML came from FrontPage so it is mostly badly formatted. 大多数HTML来自FrontPage,因此它的格式很糟糕。 Here's a quick sample of what it looks like. 这是一个快速的样子。

<b>Table 1</b>
  <table class='class1'>
  <tr>
    <td>
      <p>Procedure Name</td>
    <td>
        <p>Procedure</td>
    </tr>
  </table>
<p><b>Table 2</b></p>
  <table class='class2'>
    <tr>
      <td>
        <p>Procedure Name</td>
        <td>
        <p>Procedure</td>
    </tr>
  </table>
<p> Some text is here</p>

From what I understand, FrontPage automatically adds a <p> in every new cell. 根据我的理解,FrontPage会在每个新单元格中自动添加<p>

I want to remove those <p> tags that are inside the tables but keep the ones outside the tables. 我想删除这些<p>是表格的标签,但保留表外的人。 I tried 2 methods so far: 到目前为止我尝试了两种方法:

First method 第一种方法

First method was to use a single RegEx tp capture every <p> tag in the tables and then to Regex.Replace() to remove them. 第一种方法是使用单个RegEx tp捕获表中的每个<p>标记,然后使用Regex.Replace()来删除它们。 However I never managed to get the right RegEx for this. 但是我从来没有设法为此获得正确的RegEx。 (I know parsing HTML with RegEx is bad. I thought the data was simple enough to apply RegEx to it). (我知道使用RegEx解析HTML很糟糕。我认为数据很简单,可以将RegEx应用到它)。

I can get everything in each table quite easily using this regex: <table.*?>(.*?)</table> 我可以使用这个正则表达式轻松地获取每个表中的所有内容: <table.*?>(.*?)</table>

Then I wanted to only grab the <p> tags so I wrote this: (?<=<table.*?>)(<p>)(?=</table>) . 然后我只想抓取<p>标签,所以我写了这个: (?<=<table.*?>)(<p>)(?=</table>) This doesn't match anything. 这与任何事情都不相符。 (Apparently .NET allows quantifiers in their lookbehinds. At least that's the impression I had while using http://regexhero.net/tester/ ) (显然.NET允许量词在他们的外观中。至少那是我在使用http://regexhero.net/tester/时的印象)

Any way I can modify this RegEx to capture only what I need? 我可以通过任何方式修改此RegEx以仅捕获我需要的内容吗?

Second method 第二种方法

Second method was to capture only the table contents into a string and then String.Replace() to remove the <p> tags. 第二种方法是仅将表内容捕获到字符串中,然后使用String.Replace()来删除<p>标记。 I'm using the following code to capture the matches: 我正在使用以下代码来捕获匹配项:

MatchCollection tablematch = Regex.Matches(htmlSource, @"<table.*?>(.*?)</table>", RegexOptions.Singleline);

htmlSource is a string containing the whole HTML page and this variable is what will be sent back to the client after processing. htmlSource是一个包含整个HTML页面的字符串,该变量将在处理后发送回客户端。 I want to remove only what I need to remove from htmlSource . 我想只删除我需要从htmlSource删除的htmlSource

How can I use the MatchCollection to remove the <p> tags and then send the updated tables back to htmlSource ? 如何使用MatchCollection删除<p>标签,然后将更新的表发送回htmlSource

Thank you 谢谢

This answer is based on the second suggested approach. 这个答案基于第二种建议的方法。 Changed Regex to match everything inside table to : 更改正则表达式以匹配表中的所有内容:

<table.*?table>

And used Regex.Replace specifying MatchEvaluator to behave with desired replacing: 并使用Regex.Replace指定MatchEvaluator以表示所需的替换:

Regex myRegex = new Regex(@"<table.*?table>", RegexOptions.Singleline);
string replaced = myRegex.Replace(htmlSource, m=> m.Value.Replace("<p>",""));
Console.WriteLine(replaced);

Output using question input: 使用问题输入输出:

<b>Table 1</b>
    <table class='class1'>
    <tr>
    <td>
        Procedure Name</td>
    <td>
        Procedure</td>
    </tr>
    </table>
<p><b>Table 2</b></p>
    <table class='class2'>
    <tr>
        <td>
        Procedure Name</td>
        <td>
        Procedure</td>
    </tr>
    </table>
<p> Some text is here</p>

I guess by using a delegate (callback) it could be done. 我想通过使用委托(回调)它可以完成。

string html = @"
<b>Table 1</b>
  <table class='class1'>
  <tr>
    <td>
      <p>Procedure Name</td>
    <td>
        <p>Procedure</td>
    </tr>
  </table>
<p><b>Table 2</b></p>
  <table class='class2'>
    <tr>
      <td>
        <p>Procedure Name</td>
        <td>
        <p>Procedure</td>
    </tr>
  </table>
<p> Some text is here</p>
";

Regex RxTable = new Regex( @"(?s)(<table[^>]*>)(.+?)(</table\s*>)" );
Regex RxP = new Regex( @"<p>" );

string htmlNew = RxTable.Replace( 
    html,
    delegate(Match match)
    {
       return match.Groups[1].Value + RxP.Replace(match.Groups[2].Value, "") + match.Groups[3].Value;
    }
);
Console.WriteLine( htmlNew );

Output: 输出:

<b>Table 1</b>
  <table class='class1'>
  <tr>
    <td>
      Procedure Name</td>
    <td>
        Procedure</td>
    </tr>
  </table>
<p><b>Table 2</b></p>
  <table class='class2'>
    <tr>
      <td>
        Procedure Name</td>
        <td>
        Procedure</td>
    </tr>
  </table>
<p> Some text is here</p>

Generally regex allows you to work with nested structures, it's very ugly and you should avoid it, but if you haven't other option, you can use it. 通常正则表达式允许你使用嵌套结构,它非常难看,你应该避免它,但如果你没有其他选项,你可以使用它。

static void Main()
{
    string s = 
@"A()
{
    for()
    {
    }
    do
    {
    }
}
B()
{
    for()
    {
    }   
}
C()
{
    for()
    {
        for()
        {
        }
    }   
}";

    var r = new Regex(@"  
                      {                       
                          (                 
                              [^{}]           # everything except braces { }   
                              |
                              (?<open>  { )   # if { then push
                              |
                              (?<-open> } )   # if } then pop
                          )+
                          (?(open)(?!))       # true if stack is empty
                      }                                                                  

                    ", RegexOptions.IgnorePatternWhitespace | RegexOptions.ExplicitCapture);

    int counter = 0;

    foreach (Match m in r.Matches(s))
        Console.WriteLine("Outer block #{0}\r\n{1}", ++counter, m.Value);

    Console.Read();
}

here regex "knows" where block starts and where it ends, so you can use this information to remove <p> tag if it haven't appropriate closing one. 这里正则表达式“知道”块的开始位置和结束位置,因此如果没有合适的关闭标记,您可以使用此信息删除<p>标记。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM