简体   繁体   中英

Web Scraping with Java/Jsoup

I am trying to extract average salary from GlassDoor. This is the HTML code where it is:

<span class="OccMedianBasePayStyle__payNumber" data-test="AveragePay">$118,034</span>

Here is what I have so far.. This code outputs the line I want but I don't know how to just pull out the salary from data-test="AveragePay"

public class Trans {

    public static void main(String[] args) {
        String url = "https://www.glassdoor.com/Salaries/seattle-software-engineer-salary-SRCH_IL.0,7_IM781_KO8,25.htm";
        Document document = null;
        try {
            document = Jsoup.connect(url).get();
        } catch (IOException e) {
            e.printStackTrace();
        }

        //a with href
        Elements links = document.select("span");

        for (Element link : links) {

            System.out.println("Text: " + link.getElementsByAttributeValueContaining("data-test", "Average"));

            //System.out.println("Text: " + link.text()); 
        }

You are not using the correct selector. you should pass data-test="AveragePay" with the span.

Change your selector and for loop to this, its basically selecting elements only which has span[data-test="AveragePay"]

public static void main(String[] args) {
        String url = "https://www.glassdoor.com/Salaries/seattle-software-engineer-salary-SRCH_IL.0,7_IM781_KO8,25.htm";
        Document document = null;
        try {
            document = Jsoup.connect(url).get();
        } catch (IOException e) {
            e.printStackTrace();
        }

        //a with href
        Elements links = document.select("span[data-test='AveragePay']");

        for (Element link : links) {                
            System.out.println("Text: " + link.text());     
        }
 }

Note: I hope this is only for educational purpose. Web Scraping has some legal restrictions. You need to check the terms and conditions of the target site, before using this for any commercial purpose.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM