Use JSoup to get all textual links

Question

I'm using JSoup to grab content from web pages.

I want to get all the links on a page that have some contained text (it doesn't matter what the text is) just needs to be non-empty/image etc.

Example of links I want:

<a href="somepage.html">Link to Some Page</a>

Since it contains the text "Link to Some Page"

Links I don't want:

<a href="somepage.html"><img src="someimage.jpg"/></a>
<a href="somepage.html"></a>

My code looks like this. How can I modify it to only get the first type of link?

Document document = // I get my document object
Elements linksOnPage = document.select("a[href]") 
for (Element page : linksOnPage) {
    String link = page.attr("abs:href");
    // I do stuff with the link
}

Answer 1

You could do something like this. It does it's job though it's probably not the fanciest solution out there.

Note: the function text() gets you a clean text so if there are any HTML code fragements inside it, it won't return them.

Document doc = // get the doc
Elements linksOnPage = document.select("a");

for (Element pageElem : linksOnPage){
    String link = "";
    if(pageElem.text().trim().equals(""))
       continue;
    // do smth with it
}

Answer 2

I am using this and it's working fine:

Document document = // I get my document object
Elements linksOnPage = document.select("a:matches(([^\\s]+))"); 
for (Element page : linksOnPage) {
    String link = page.attr("abs:href");
    // I do stuff with the link
}

Use JSoup to get all textual links

Question

2 answers

solution1
3 2017-12-04 13:39:58

solution2
0 2017-12-04 07:39:47

Use JSoup to get all textual links

Question

2 answers

solution1 3 2017-12-04 13:39:58

solution2 0 2017-12-04 07:39:47

solution1
3 2017-12-04 13:39:58

solution2
0 2017-12-04 07:39:47