简体   繁体   English

用于捕获嵌套表的 Java 正则表达式模式

[英]Java regular expression pattern for capturing nested tables

I have a html page which i need to parse which has many nested tables.我有一个 html 页面,我需要解析其中包含许多嵌套表。

<table>   <table>   <table > Status </table>  </table>  </table>
<table>   <table>   </table>  </table>

I am trying to create a Java regex Pattern for matching only the text我正在尝试创建一个 Java 正则表达式模式来仅匹配文本

 <table> Status </table>

I also tried html parser like Jsoup but could not find a clean way to parse this.我也尝试过像 Jsoup 这样的 html 解析器,但找不到一种干净的方法来解析它。 I have been breaking my head on this but could not extract this text cleanly.我一直在思考这个问题,但无法干净地提取这段文字。 Any help in this regards using Java regex Pattern/jsoup is appreciated.感谢在这方面使用 Java regex Pattern/jsoup 的任何帮助。

<table\s*>\s*(([^<]|<[^t]|<t[^a]|<ta[^b]|<tab[^l]|<tabl[^e])*?)\s*</table\s*>

You can get the first captured group (what is matched between ( and ) on the regex) to get the content between <table> and </table> (in your first example Status ).您可以获取第一个捕获的组(正则表达式上()之间匹配的内容)以获取<table></table>之间的内容(在您的第一个示例Status 中)。

Explanation :解释 :

We search for a string that begin with :我们搜索以 开头的字符串:

<table\s*>\s* (\s* is for any number of blank spaces)

Contains anything but the sequence <table :包含除序列<table任何内容:

([^<]|<[^t]|<t[^a]|<ta[^b]|<tab[^l]|<tabl[^e])*

And finish with :并完成:

\s*</table\s*> (\s* is for any number of blank spaces)

And we search for the smallest possible match for the sequence between <table> and </table> (so as not to match anything after the first </table> ) with the ?我们搜索了序列之间的最小可能的匹配<table></table>以不匹配后的第一个东西</table>? after * . *之后。

Here is a regex that works:这是一个有效的正则表达式:

.*(\s*<table\s*>\s*)+(<table\s*>.*</table\s*>)(\s*</table\s*>\s*)+

The inner table with its text is in the 2nd matching group.带有文本的内表位于第二个匹配组中。

as fiddle: http://fiddle.re/w73vc6作为小提琴: http : //fiddle.re/w73vc6

This of course only works if the nesting is like you indicated, ie no other stuff inside the outer tables and no more tables inside the table with the text you need.这当然只有在嵌套如您所指示的情况下才有效,即外部表格内没有其他内容,并且表格内没有更多包含您需要的文本的表格。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM