简体   繁体   中英

Java regular expression pattern for capturing nested tables

I have a html page which i need to parse which has many nested tables.

<table>   <table>   <table > Status </table>  </table>  </table>
<table>   <table>   </table>  </table>

I am trying to create a Java regex Pattern for matching only the text

 <table> Status </table>

I also tried html parser like Jsoup but could not find a clean way to parse this. I have been breaking my head on this but could not extract this text cleanly. Any help in this regards using Java regex Pattern/jsoup is appreciated.

<table\s*>\s*(([^<]|<[^t]|<t[^a]|<ta[^b]|<tab[^l]|<tabl[^e])*?)\s*</table\s*>

You can get the first captured group (what is matched between ( and ) on the regex) to get the content between <table> and </table> (in your first example Status ).

Explanation :

We search for a string that begin with :

<table\s*>\s* (\s* is for any number of blank spaces)

Contains anything but the sequence <table :

([^<]|<[^t]|<t[^a]|<ta[^b]|<tab[^l]|<tabl[^e])*

And finish with :

\s*</table\s*> (\s* is for any number of blank spaces)

And we search for the smallest possible match for the sequence between <table> and </table> (so as not to match anything after the first </table> ) with the ? after * .

Here is a regex that works:

.*(\s*<table\s*>\s*)+(<table\s*>.*</table\s*>)(\s*</table\s*>\s*)+

The inner table with its text is in the 2nd matching group.

as fiddle: http://fiddle.re/w73vc6

This of course only works if the nesting is like you indicated, ie no other stuff inside the outer tables and no more tables inside the table with the text you need.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM