简体   繁体   中英

Regex with not match at end

I'm trying to write a regex to match patterns like this:

<td style="alskdjf" />

ie a self terminating <td>

but not this:

<td style=alsdkjf"><br /></td>

I initially came up with:

<td\s+.*?/>

but that obviously fails on the second example and I thought that something like this might work:

<td\s+.*?[^>]/>

but it doesn't. I'm using C#.NET.

Only looking for <td> 's that have an attribute. eg looking for <td style="alsdfkj" /> but not <td> .

You're going to have problems using regexps with HTML since HTML is not regular. I'd recommend using an HTML parser for all but the very simplest cases.

This will match what you're looking for, and not match the problematic case you had with your first few tries:

<td[^>]*?/>

Note, however, that if you need to allow > characters in attribute values, you'd need something like this:

<td(?:[^>]|"[^"]*?")*?/>

Which allows > only within matching double-quotes (you could similarly expand it to allow single-quotes).

You can add whatever specific attribute you're looking for into the regex; for instance for your example:

<td[^>]*? style="alskdjf"[^>]*?/>

Regex will have serious trouble interpreting messy HTML, as is the sort browsers often have to deal with. There are all sorts of horrible obfuscations that can be done to the markup that you just don't want to have to think about!

The HTML Agility Pack is what you really want to be using, and has had very good reviews everywhere I've seen. It is a robust library for reading any kind of mangled HTML into a DOM model. I have personally found it to be an superb library, as surely have others, many using the library in the context of business applications.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM