简体   繁体   English

在带有未知标记名的html标记之间提取?

[英]Extract between html tag with unknown tagname?

<b>Topic1</b><ul>asdasd</ul><br/><b>Topic2</b><ul>....

I want to extract everything that comes after <b>Topic1</b> and the next <b> starting tag. 我想提取<b>Topic1</b>和下一个<b>起始标记之后的所有内容。 Which in this case would be: <ul>asdasd</ul><br/> . 在这种情况下为: <ul>asdasd</ul><br/>

Problem: it must not necessairly be the <b> tag, but could be any other repeating tag. 问题:不必一定是<b>标记,而可以是任何其他重复标记。

So my question is: how can I dynamically extract those text? 所以我的问题是:如何动态提取这些文本? The only static thinks are: 唯一静态的想法是:

  • The signal keyword to look for is always "Topic1". 要查找的信号关键字始终为“ Topic1”。 I'd like to take the surrounding tags as the one to look for. 我想将周围的标签作为要查找的标签。
  • The tag is always repeated. 标签总是重复的。 In this case it's always <b> , it might as well be <i> or <strong> or <h1> etc. 在这种情况下,它始终是<b> ,也可能是<i><strong><h1>等。

I know how to write the java code, but what would the regex be like? 我知道如何编写Java代码,但是正则表达式会是什么样子?

String regex = ">Topic1<";
Matcher m = Pattern.compile(regex).matcher(text);
while (m.find()) {
    for (int i = 1; i <= m.groupCount(); i++) {
        System.out.println(m.group(i));
    }
}

The following should work 以下应该工作

Topic1</(.+?)>(.*?)<\\1>

Input: <b>Topic1</b><ul>asdasd</ul><br/><b>Topic2</b><ul> 输入: <b>Topic1</b><ul>asdasd</ul><br/><b>Topic2</b><ul>

Output: <ul>asdasd</ul><br/> 输出: <ul>asdasd</ul><br/>

Code: 码:

    Pattern p = Pattern.compile("Topic1</(.+?)>(.*?)<\\1>");
    //  get a matcher object
    Matcher m = p.matcher("<b>Topic1</b><ul>asdasd</ul><br/><b>Topic2</b><ul>");
    while(m.find()) {
        System.out.println(m.group(2));  // <ul>asdasd</ul><br/>
    }

Try this 尝试这个

String pattern = "\\<.*?\\>Topic1\\<.*?\\>"; // this will see the tag no matter what tag it is
String text = "<b>Topic1</b><ul>asdasd</ul><br/><b>Topic2</b>"; // your string to be split
String[] attributes = text.split(pattern);
for(String atr : attributes) 
{
    System.out.println(atr);
}

Will print out: 将打印出:

<ul>asdasd</ul><br/><b>Topic2</b>

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM