简体   繁体   中英

How to parse HTML text and links with java and jsoup

I need to parse text from a webpage. The text is presented in this way:

nonClickableText= link1 link2  nonClickableText2= link1 link2

I want to be able to convert all to a string in java. The non clickable text should remain like it is while the clickable text should be replaced with its actual link.

So in java I would have this:

String parsedHTML = "nonClickableText= example.com example.com nonClickableText2= example3.com example4.com";

Here are some pictures: first second

What exactly is link1 and link2 ? According to your example

"... nonClickableText2= example3.com example4.com"

they can be different, so what would be the source besides the href ?

Based on you images the following code should give you everything to adopt your final string presentation. First we grab the <strong> -block and then go through the child nodes, using <a> -children with preceding text-nodes:

String htmlString = "<html><div><p><strong>\"notClickable1\"<a rel=\"nofollow\" target=\"_blank\" href=\"example1.com\">clickable</a>\"notClickable2\"<a rel=\"nofollow\" target=\"_blank\" href=\"example2.com\">clickable</a>\"notClickable3\"<a rel=\"nofollow\" target=\"_blank\" href=\"example3.com\">clickable</a></strong></p></div></html>";

Document doc = Jsoup.parse(htmlString); //can be replaced with Jsoup.connect("yourUrl").get();
String parsedHTML = "";

Element container = doc.select("div>p>strong").first();

for (Node node : container.childNodes()) {
    if(node.nodeName().equals("a") && node.previousSibling().nodeName().equals("#text")){
        parsedHTML += node.previousSibling().toString().replaceAll("\"", "");
        parsedHTML += "= " + node.attr("href").toString() + " ";
    }
}
parsedHTML.trim();

System.out.println(parsedHTML);

Output:

notClickable1= example1.com notClickable2= example2.com notClickable3= example3.com 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM