简体   繁体   中英

Regex to Remove HTML Tags

input:

<td>
<span>
<span>spanaaa</span>
<span class="1">spanbbb</span>
<span class="" style="">spanccc</span>
<span style="display:none">spanddd</span>

<div>divaaa</div>
<div class="1">divbbb</div>
<div class="" style="">divccc</div>
<div style="display:none">divddd</div>
</span>
</td>

I need a regular express or a method in order to get the values without attribute style="display:none"

output:

spanaaa
spanbbb
spanccc

divaaa
divbbb
divccc

The pattern [.NET flavor]

(?<=<\w+ [^<>]*?\w+=")(?!display:none)(?<mt>[^"<>]+)(?=")

Options: ^ and $ match at line breaks

Assert that the regex below can be matched, with the match ending at this position (positive lookbehind) «(?<=<\w+ [^<>]*?\w+=")»
   Match the character “<” literally «<»
   Match a single character that is a “word character” (letters, digits, and underscores) «\w+»
      Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
   Match the character “ ” literally « »
   Match a single character NOT present in the list “<>” «[^<>]*?»
      Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
   Match a single character that is a “word character” (letters, digits, and underscores) «\w+»
      Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
   Match the characters “="” literally «="»
Assert that it is impossible to match the regex below starting at this position (negative lookahead) «(?!display:none)»
   Match the characters “display:none” literally «display:none»
Match the regular expression below and capture its match into backreference with name “mt” «(?<mt>[^"<>]+)»
   Match a single character NOT present in the list “"<>” «[^"<>]+»
      Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Assert that the regex below can be matched, starting at this position (positive lookahead) «(?=")»
   Match the character “"” literally «"»

The pattern [PCRE]

<!--
(<\w+ [^<>]*?\w+=")(?!display:none)([^"<>]+)(?=")

Options: ^ and $ match at line breaks

Match the regular expression below and capture its match into backreference number 1 «(<\w+ [^<>]*?\w+=")»
   Match the character “<” literally «<»
   Match a single character that is a “word character” (letters, digits, and underscores) «\w+»
      Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
   Match the character “ ” literally « »
   Match a single character NOT present in the list “<>” «[^<>]*?»
      Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
   Match a single character that is a “word character” (letters, digits, and underscores) «\w+»
      Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
   Match the characters “="” literally «="»
Assert that it is impossible to match the regex below starting at this position (negative lookahead) «(?!display:none)»
   Match the characters “display:none” literally «display:none»
Match the regular expression below and capture its match into backreference number 2 «([^"<>]+)»
   Match a single character NOT present in the list “"<>” «[^"<>]+»
      Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Assert that the regex below can be matched, starting at this position (positive lookahead) «(?=")»
   Match the character “"” literally «"»

正则表达式不是一个很好的选择(由于HTML的变化无常),但是您可以尝试以下方法:

<div(?!\s*style="display:none")[^>]*>(.*?)</div>
input = Regex.Replace(input, @"<div style=""display:none"">(.|\n)*?</div>", string.Empty, RegexOptions.Singleline);  

Here input is the string that contains Html. Try this regex, it will work!

It is CSharp version that 8x faster than regex parsing. You can easily convert to any language you would like.

public static string StripTagsCharArray(string source)
{
char[] array = new char[source.Length];
int arrayIndex = 0;
bool inside = false;

for (int i = 0; i < source.Length; i++)
{
    char let = source[i];
    if (let == '<')
    {
    inside = true;
    continue;
    }
    if (let == '>')
    {
    inside = false;
    continue;
    }
    if (!inside)
    {
    array[arrayIndex] = let;
    arrayIndex++;
    }
}
return new string(array, 0, arrayIndex);
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM