I have a lot of this lines in a webpage:
<a href="City1/Waves321.aspx"><span><span style="font-family: Courier New">Title</span></span></a>
<span style="font-family: Courier New"> (<a href="City1/River267.aspx">txt</a>)</span></li></ul>
<a href="City2/Waves761.aspx"><span><span style="font-family: Courier New">Title</span></span></a>
<span style="font-family: Courier New"> (<a href="City2/River767.aspx">txt</a>)</span></li></ul>
and i want to get only:
City1/Waves321.aspx
City2/Waves761.aspx
and so on... every ahref before "Title".
I tested with this code:
public class ListLinks {
public static void main(String[] args) throws IOException {
Validate.isTrue(args.length == 1, "usage: supply url to fetch");
String url = args[0];
String address;
Document doc = Jsoup.connect(url).timeout(10*1000).get();
Elements links = doc.select("a[href~=(Waves)]");
//String linkText = links.text();
for (Element link : links) {
String linkHref = link.attr("href");
address = url + linkHref;
System.out.println(address);
}
and it works for most of the links, but it misses the ones that "Title" is in a new line, like this:
<a href="City/Waves321.aspx"><span><span style="font-family: Courier New">
Title</span></span></a><span style="font-family: Courier New"> (<a href="City/River267.aspx">txt</a>)</span></li></ul>
I cannot change the webpage code (by the way:/)
How can i achieve this in Jsoup?
you can do like this -
Elements e = doc.getElementsByTag("a");
e.stream().forEach(p -> System.out.println(p.attr("href")));
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.