简体   繁体   中英

Regex: find defined string between html tags in Groovy or Java

From Jenkins I'm using Confluence API for getting the content of a page in HTML such like this:

<tr>
    <td>bla1a</td>
    <td>bla2a</td>
    <td>bla3a</td>
</tr>
<tr>
    <td>bla1b</td>
    <td>what I’m searching</td>
    <td>bla3b</td>
</tr>
<tr>
    <td>bla1c</td>
    <td>bla2c</td>
    <td>bla3c</td>
</tr>

What I want is to Update the content of a particular line of a table where I just know the value of a string, in this case “what I'm searching”, so what I need is a regex that match everything inside a table row and the searched string:

<tr> … what I’m searching …</tr>

and returns the entire row as follow:

<tr>
    <td>bla1b</td>
    <td>what I’m searching</td>
    <td>bla3b</td>
</tr>

Don't use regex to extract data and manipulating HTML. Mandatory links You can't parse [X]HTML with regex and why-its-not-possible-to-use-regex-to-parse-html-xml-a-formal-explanation . Use a proper parser instead. For example Jsoup . Jsoup provides a very convenient API for extracting and manipulating HTML data and is intuitive to work with. Selector syntax selector-syntax or here Selector . Using Jsoup your code could look like:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;

public class Example {

    public static void main(String[] args) throws IOException {
        String html =
                "<html>\n"
                + "<head></head>"
                + "<body>"
                + "    <table>"
                + "        <tr>\n"
                + "            <td>bla1a</td>\n"
                + "            <td>bla2a</td>\n"
                + "            <td>bla3a</td>\n"
                + "        </tr>\n"
                + "        <tr>\n"
                + "             <td>bla1b</td>\n"
                + "             <td>what I’m searching</td>\n"
                + "             <td>bla3b</td>\n"
                + "        </tr>\n"
                + "        <tr>\n"
                + "             <td>bla1c</td>\n"
                + "             <td>bla2c</td>\n"
                + "             <td>bla3c</td>\n"
                + "        </tr>"
                + "    </table>"
                + "</body>\n"
                + "</html>";

        Document doc = Jsoup.parse(html);

        Element result = doc.selectFirst("tr:contains(what I’m searching)");
        System.out.println(result);
    }
}

output:

<tr> 
 <td>bla1b</td> 
 <td>what I’m searching</td> 
 <td>bla3b</td> 
</tr>

You can also easily manipulate your html:

Element td = result.selectFirst("td:contains(what I’m searching)");
td.text("My updated data");
System.out.println(result);

output

<tr> 
 <td>bla1b</td> 
 <td>My updated data</td> 
 <td>bla3b</td> 
</tr>

Maven repo:

<!-- https://mvnrepository.com/artifact/org.jsoup/jsoup -->
<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.15.2</version>
</dependency>

For rather simple look-up like yours, you don't really have to use any external tools, a simple regex would perfectly do.

Also, it's going to be more performant and less resource-hungry.

I'd put it like so:

String txt = '''\
<tr>
  <td>bla1a</td>
  <td>bla2a</td>
  <td>bla3a</td>
</tr>
<tr>
  <td>bla1b</td>
  <td>what I’m searching</td>
  <td>bla3b</td>
</tr>
<tr>
  <td>bla1c</td>
  <td>bla2c</td>
  <td>bla3c</td>
</tr>'''

List res = ( txt =~ /(?s)<tr>(\s*<td>[\w\s]+<\/td>\s*)*<td>what I’m searching<\/td>(\s*<td>[\w\s]+<\/td>\s*)*<\/tr>/ ).findAll()*.first()

assert res == ['''<tr>
  <td>bla1b</td>
  <td>what I’m searching</td>
  <td>bla3b</td>
</tr>''']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM