I want to extract the url that's inside the src
attribute of an <img src="...">
tag of a certain website. How can i do that using Jsoup
in Java
? So far, i've only tried reading the whole tag and printing the output in the console but nothing seems to come up. I'd love to know how to access attributes of tags in general since i'll need to do this same process for various tags. In my test code below, i'm reading some Strings
from a table
using the raritySelector
and the output is what's expected. However, when i try reading the img
tag from the website using the iconSelector
, nothing is printed in the console. Do i need to specify something else in order to read the <img>
's attributes/details or am i doing something wrong?
String url = "https://dbz.space/cards/";
Document page = Jsoup.connect(url).get();
ArrayList<String> cardRarity = new ArrayList<>();
ArrayList<String> iconUrls = new ArrayList<>();
for(int i=1; i < 6; i++) {
String iconSelector = "body > div.view > section.list.gi > div:nth-child(1) > div.content > img";
String raritySelector = "body > div.view > section.list.gi > div:nth-child(" + i + ") > a > table > tbody > tr:nth-child(2) > td.rarity > i";
Elements rarities = page.select(raritySelector);
Elements icons = page.select(iconSelector);
for(Element e : rarities) {
cardRarity.add(e.text());
}
for(Element e : icons) {
iconUrls.add(e.text());
}
}
for(String s : cardRarity) {
System.out.println(s);
}
for(String s : iconUrls) {
System.out.println(s);
}
PS: I've never used Jsoup before or worked with website scraping and after doing a bit of research, i came across various posts where people were suggesting that you use Regex
or the String API
but none of them could agree on which one is the right way to go. Please point me in the right direction on this matter if possible.
Your "Problem" is, that jsoup
is a html parser and works with the plain html response returned from this website.
It`s not handling it like a "normal" browser and therefor eg Javascript is not executed.
The linked page inital response does not contain elements with this selector
"body > div.view > section.list.gi > div:nth-child(1) > div.content > img"
Instead there is some inital markup and it get changed by Javascript in your browser to display/build up the full website
Inital Markup looks like this (you see that by looking into the source code, eg in chrome view-source:https://dbz.space/cards/
)
<section class="list gi">
<div class="item card cb45 eb24 rb5 d0" res="1018030" base="1018031" aim="" quantity="" release="" imgur="MsVAmR3" ele="4" type="2">
<div class="content"></div>
<a class="ab" href="/cards/1018031-androids-17-18android-16-the-androids-journey" title="The Androids' Journey - Androids #17 & #18/Android #16" hash="7b0463b1a48488b0e3670cc3ae46731f">
<table>
<tr>
<td class="dokkan"></td>
<td class="element"></td>
</tr>
<tr>
<td class="rarity">
<i>lr</i>
</td>
<td class="lock off">
<i class="material-icons off"></i>
<i class="material-icons on"></i>
</td>
</tr>
</table>
</a>
<div class="dv">19836</div>
</div>
<div class="item card cb25 eb12 rb5 d0" res="1012900" base="1012901" aim="" quantity="" release="" imgur="vId5fzO" ele="2" type="1">
<div class="content"></div>
<a class="ab" href="/cards/1012901-super-saiyan-goku-super-saiyan-vegeta-fused-super-power" title="Fused Super Power - Super Saiyan Goku & Super Saiyan Vegeta" hash="9fb89cd0e5449af5bae38a8602879494">
...
</div>
</section>
So if you adapt your selector accordingly
"body > div.view > section.list.gi > div.item.card";
You could read out for example the imgur filename or other infos
e.attr("imgur")
As an alternative you could use phantomjs
/ GhostDriver
(just google that) which act like a browser engine to fetch the page first, and later use eg jsoup.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.