简体   繁体   中英

Extracting the src value from an <img> tag on a website using java and jsoup

I want to extract the url that's inside the src attribute of an <img src="..."> tag of a certain website. How can i do that using Jsoup in Java ? So far, i've only tried reading the whole tag and printing the output in the console but nothing seems to come up. I'd love to know how to access attributes of tags in general since i'll need to do this same process for various tags. In my test code below, i'm reading some Strings from a table using the raritySelector and the output is what's expected. However, when i try reading the img tag from the website using the iconSelector , nothing is printed in the console. Do i need to specify something else in order to read the <img> 's attributes/details or am i doing something wrong?

        String url = "https://dbz.space/cards/";
        Document page = Jsoup.connect(url).get();
        ArrayList<String> cardRarity = new ArrayList<>();
        ArrayList<String> iconUrls = new ArrayList<>();

        for(int i=1; i < 6; i++) {

            String iconSelector = "body > div.view > section.list.gi > div:nth-child(1) > div.content > img";
            String raritySelector = "body > div.view > section.list.gi > div:nth-child(" + i + ") > a > table > tbody > tr:nth-child(2) > td.rarity > i";

            Elements rarities = page.select(raritySelector);
            Elements icons = page.select(iconSelector);

            for(Element e : rarities) {
                cardRarity.add(e.text());
            }

            for(Element e : icons) {
                iconUrls.add(e.text());
            }
        }


        for(String s : cardRarity) {
            System.out.println(s);
        }
        for(String s : iconUrls) {
            System.out.println(s);
        }

PS: I've never used Jsoup before or worked with website scraping and after doing a bit of research, i came across various posts where people were suggesting that you use Regex or the String API but none of them could agree on which one is the right way to go. Please point me in the right direction on this matter if possible.

Your "Problem" is, that jsoup is a html parser and works with the plain html response returned from this website.

It`s not handling it like a "normal" browser and therefor eg Javascript is not executed.

The linked page inital response does not contain elements with this selector

"body > div.view > section.list.gi > div:nth-child(1) > div.content > img"

Instead there is some inital markup and it get changed by Javascript in your browser to display/build up the full website

Inital Markup looks like this (you see that by looking into the source code, eg in chrome view-source:https://dbz.space/cards/ )

<section class="list gi">
    <div class="item card cb45 eb24 rb5 d0" res="1018030" base="1018031" aim="" quantity="" release="" imgur="MsVAmR3" ele="4" type="2">
        <div class="content"></div>
        <a class="ab" href="/cards/1018031-androids-17-18android-16-the-androids-journey" title="The Androids' Journey - Androids #17 & #18/Android #16" hash="7b0463b1a48488b0e3670cc3ae46731f">
            <table>
                <tr>
                    <td class="dokkan"></td>
                    <td class="element"></td>
                </tr>
                <tr>
                    <td class="rarity">
                        <i>lr</i>
                    </td>
                    <td class="lock off">
                        <i class="material-icons off">&#xE898;</i>
                        <i class="material-icons on">&#xE897;</i>
                    </td>
                </tr>
            </table>
        </a>
        <div class="dv">19836</div>
    </div>
    <div class="item card cb25 eb12 rb5 d0" res="1012900" base="1012901" aim="" quantity="" release="" imgur="vId5fzO" ele="2" type="1">
        <div class="content"></div>
        <a class="ab" href="/cards/1012901-super-saiyan-goku-super-saiyan-vegeta-fused-super-power" title="Fused Super Power - Super Saiyan Goku & Super Saiyan Vegeta" hash="9fb89cd0e5449af5bae38a8602879494">
        ...
    </div>
</section>

So if you adapt your selector accordingly

"body > div.view > section.list.gi > div.item.card";

You could read out for example the imgur filename or other infos

e.attr("imgur")

As an alternative you could use phantomjs / GhostDriver (just google that) which act like a browser engine to fetch the page first, and later use eg jsoup.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM