简体   繁体   中英

How to avoid html blocks with regex

I have to find all the strings surrounded by "[" and "]" using regex, but avoiding the ones inside the <table></table> block, for example:

<html>
<body>
<p><table>
   <tbody>
      <tr>
         <td style="border-style: solid; border-width:1px;">
            <span style="font-family: Courier;">[data1]</span>
         </td>
         <td style="border-style: solid; border-width:1px;">
            <span style="font-family: Courier;">[data10]</span>
         </td>
      </tr>
   </tbody>
</table>
</p>
<p>[data3]&nbsp;&nbsp;[data4]&nbsp;&nbsp;[data5]</p>
</body>
</html>

in this case only [data3], [data4] and [data5] should be found. So far I have this: @"(((?<?<span>)(\[[a-zA-Z_0-9]+)](??<\/span>))|((?<.<span>)(\[[a-zA-Z_0-9]+)])|((\[[a-zA-Z_0-9]+)](?!<\/span>)))(?!.*\1)" That finds all the [] blocks that are not surrounded by tags and I tried adding a negative lookahead and lookbehind of but it doesn't work, it stills gets the ones inside the table block.

Hope you guys can help me with this.

Below regex will return your all [data] which enclose in <p> </p> tag.

/<p.*?>\[(.*?)\]<*.p>/g

so above regex will return this <p>[data3]&nbsp;&nbsp;[data4]&nbsp;&nbsp;[data5]</p> from your above HTML code.

When you get that string from above regex then use below regex to get only all [data] string.

/\[(.*?)\]/g

so above regex will return " [data3][data4][data5] " from above string.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM