简体   繁体   中英

Finding shortest regex match in HTML

I would like to match the last column in the first row from the following HTML: (this is just an example)

<tr> <td> ABC </td> <td> DEF </td> <td> ABC </td> </tr> 
<tr> <td> GHI </td> <td> JKL </td> <td> GHI </td> </tr>

So what I want to match is: <td> ABC </td> </tr>

I tried toying around with regex101.com but I just can't find a proper way to match the last <td> from the first row only.

What I got so far is the following regex: (<td>).*?(<\\/tr>) which matches

<td> ABC </td> <td> DEF >/td> <td> ABC </td> </tr> though.

Is there any way to match only the shortest string between <td> and </tr> ? (I found similar questions but can't figure out a solution to this one.)

Prepend your pattern with "start of string" ( ^ ) + "anything but </tr> " ( (?:.(?!<\\/tr>))* ) to ensure no </tr> appears before your pattern (and therefore your match is the first one before </tr> ). The original pattern should be captured with a group then:

^(?:.(?!<\/tr>))*((?:<td>).*?(?:<\/tr>))

Demo: https://regex101.com/r/enGASL/1

I would use this to match the required text into a group:

.*(<td>.+<\/td>?.+<\/tr>)

Here is the Regex demo

Regex 1 is 11 characters,

 <td.{14}tr> 

Regex 2 is 30 characters but it'll cover any amount of content,

 <td>\\s*\\w*?\\s*<\\/td>\\s*<\\/tr> 

but the real problem is that you wanted only one match, while this regex like most others will match more than once when the string is a multi-line HTML fragment. The solution is simple:

No global flag - Once a match is found it stops

Demo

 /* Regex 1 || Literal: <td || Any 14 char or space (no line terminators) || Literal: tr> || NO GLOBAL FLAG - Once a match is found it stops */ const rgx1 = /<td.{14}tr>/; /* Regex 2 || Literal: <td> || Zero or more spaces || Zero or more word characters lazily collect until || Zero or more spaces || Literal: <\\td> || Zero or more spaces || Literal: </tr> */ const rgx2 = /<td>\\s*\\w*?\\s*<\\/td>\\s*<\\/tr>/ const str = `<tr> <td> ABC </td> <td> DEF </td> <td> ABC </td> </tr> <tr> <td> GHI </td> <td> JKL </td> <td> GHI </td> </tr>`; let res1 = str.match(rgx1); let res2 = str.match(rgx2); console.log('Result 1: ' + res1); console.log('Result 2: ' + res2); 

BTW, there's a typo in the string: DEF >/td> and JKL >/td>

 console.log( `<tr> <td> ABC </td> <td> DEF </td> <td> XCC </td> </tr> <tr> <td> GHI </td> <td> JKL </td> <td> GHI </td> </tr>` .match(/\\w+(?=[</> td]+r>)/) ) 

Be as precise as possible when writing your regexps.

(<td>)[^\<\>]*(<\/td>)\s*(<\/tr>)

This assumes that the contents of the td tag does not contain html markup.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM