简体   繁体   中英

Use JSoup to get all textual links

I'm using JSoup to grab content from web pages.

I want to get all the links on a page that have some contained text (it doesn't matter what the text is) just needs to be non-empty/image etc.

Example of links I want:

<a href="somepage.html">Link to Some Page</a>

Since it contains the text "Link to Some Page"

Links I don't want:

<a href="somepage.html"><img src="someimage.jpg"/></a>
<a href="somepage.html"></a>

My code looks like this. How can I modify it to only get the first type of link?

Document document = // I get my document object
Elements linksOnPage = document.select("a[href]") 
for (Element page : linksOnPage) {
    String link = page.attr("abs:href");
    // I do stuff with the link
}

You could do something like this. It does it's job though it's probably not the fanciest solution out there.

Note: the function text() gets you a clean text so if there are any HTML code fragements inside it, it won't return them.

Document doc = // get the doc
Elements linksOnPage = document.select("a");

for (Element pageElem : linksOnPage){
    String link = "";
    if(pageElem.text().trim().equals(""))
       continue;
    // do smth with it
}

I am using this and it's working fine:

Document document = // I get my document object
Elements linksOnPage = document.select("a:matches(([^\\s]+))"); 
for (Element page : linksOnPage) {
    String link = page.attr("abs:href");
    // I do stuff with the link
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM