Java regex to replace only text in html file

Question

I have to write some code in Java which highlights text of a html file displayed in a JTextPane .

For highlighting I replace "match" with "<span style=\\"background-color: #FFFF00\\">match</span>" and set the whole replaced text in the JTextPane . Everything works fine! I do this with the help of java.util.regex.Pattern and java.util.regex.Matcher .

Now, I determinded a problem: The matcher also matches text within a html tag. For example this line:

<pre><a name="hello-world">Hello World</a></pre>

I need a regex, to create a java.util.regex.Pattern that only searchs in the String "Hello World".

So, if I want to highlight the matches of "e" it should looks like

<pre><a name="hello-world">H<span style=\"background-color: #FFFF00\">e</span>llo World</a></pre>

Thank you very much for your help!!

Answer 1

I would do something like:

Pattern pattern = Pattern.compile("^>(.*)$<");
Matcher matcher = pattern.matcher(matchedTextBuilder.toString());
while (matcher.find()) {
    String matchedFoundText = matcher.group();
}

A better approach:

public static void main(String[] args) {
    String originalString = "dfedf >Hello< href= ui /> Hello< another";
    StringBuilder sb = new StringBuilder("");
    Pattern pattern = Pattern.compile(">(\\s+)?\\w+(\\s+)?<");
    Matcher matcher = pattern.matcher(originalString);
    int endIndex = 0;
    while (matcher.find()) {
        String matchedFoundText = matcher.group();
        sb.append(originalString.substring(endIndex, matcher.start() + 1));
        sb.append(matchedFoundText.substring(1, matchedFoundText.length() - 1).replaceAll("e",
                "<span style=\"background-color: #FFFF00\">e</span>"));
        sb.append("<");
        endIndex = matcher.end();
    }
    sb.append(originalString.substring(endIndex + 1));
    System.out.println(sb.toString());

}

Answer 2

Try it with Jsoup a html parser which can be used to scrape and parse HTML from a URL, file, or string but also to manipulate the HTML elements, attributes, and text. Example for your case:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class NewClass2 {

    public static void main(String args[]) {
        String html = " <!DOCTYPE html>\n" +
                        "<html>\n" +
                            "<head>\n" +
                                "<title>Page Title</title>\n" +
                            "</head>\n" +
                            "<body>\n" +
                                "<h1>This is a Heading which should match</h1>\n" +
                                "<p>This is a paragraph which should also match.</p>\n" +
                            "</body>\n" +
                        "</html> ";

        String matchWord = "match";
        Document doc = Jsoup.parse(html);
        System.out.println("before : \n");
        System.out.println(doc.toString()+"\n");

        Elements matchingElements = doc.getElementsContainingOwnText(matchWord);
        for (Element e : matchingElements) {
            e.html(e.html().replace(matchWord,"<span style=\"background-color: #FFFF00\">"+matchWord+"</span>"));
        }
        System.out.println("after : \n");
        System.out.println(doc.toString());
   }
}

Java regex to replace only text in html file

Question

2 answers

solution1
0 2016-12-12 14:23:59

solution2
0 2016-12-12 17:37:34

Java regex to replace only text in html file

Question

2 answers

solution1 0 2016-12-12 14:23:59

solution2 0 2016-12-12 17:37:34

solution1
0 2016-12-12 14:23:59

solution2
0 2016-12-12 17:37:34