简体   繁体   中英

On Android Java, how to match a String within a Div Tag and extract a value?

Here's my code;

 private static String fetch(String url) throws MalformedURLException, IOException, UnsupportedEncodingException {
    String userAgent1 = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36 OPR/56.0.3051.43";
    try {
        Document doc1 = Jsoup.connect(url).userAgent(userAgent1).get();
        Elements divTags = doc1.getElementsByTag("div");
        String re = "^<div class=\\\"Ta\\(c\\) Py\\(6px\\) Bxz\\(bb\\) BdB Bdc\\(\\$seperatorColor\\) Miw\\(120px\\) Miw\\(100px\\)\\-\\-pnclg D\\(tbc\\)\\\" data-test=\\\"fin-col\\\"><span>.*</span></div>$";
        
        for (Element div : divTags) {
            Pattern pattern = Pattern.compile(re, Pattern.DOTALL);
            Matcher matcher = pattern.matcher(div.html());

            if (matcher.find()) {
                String data = matcher.group(1);
                Log.d("Matched: ", data);
            }
            else {
                Log.d("Nothing Matched: ", "");
            }
        }
    } catch (Exception e) {
        Log.e("err-new", "err", e);
    }
    return "";
}

This function takes a URL as input, in our case: https://finance.yahoo.com/quote/AAPL/financials?p=AAPL&guccounter=2 and extracts all the div tags using JSOUP .

And then, I need to extract these values using Pattern matching. But, in my code above, all I get is that "Nothing matched: " .

Here's the web page from which I am interested in getting the four numeric values corresponding to the first four yearly columns, corresponding to the row named EBIT . (Stands for Earnings Before Interest and Taxes)

Link : https://finance.yahoo.com/quote/AAPL/financials?p=AAPL&guccounter=2

Input : Looking to get values 122,034,000, 111,852,000, 69,964,000, 69,313,000 on the EBIT row for columns 9/30/2022, 9/30/2021, 9/30/2020, 9/30/2019.

On Inspect , these values are under the following <div> tags.

EBIT 1: <div class="Ta(c) Py(6px) Bxz(bb) BdB Bdc($seperatorColor) Miw(120px) Miw(100px)--pnclg D(tbc)" data-test="fin-col"><span>122,034,000</span></div>

EBIT 2: <div class="Ta(c) Py(6px) Bxz(bb) BdB Bdc($seperatorColor) Miw(120px) Miw(100px)--pnclg Bgc($lv1BgColor) fi-row:h_Bgc($hoverBgColor) D(tbc)" data-test="fin-col"><span>111,852,000</span></div>

EBIT 3: <div class="Ta(c) Py(6px) Bxz(bb) BdB Bdc($seperatorColor) Miw(120px) Miw(100px)--pnclg D(tbc)" data-test="fin-col"><span>69,964,000</span></div>

EBIT 4: <div class="Ta(c) Py(6px) Bxz(bb) BdB Bdc($seperatorColor) Miw(120px) Miw(100px)--pnclg Bgc($lv1BgColor) fi-row:h_Bgc($hoverBgColor) D(tbc)" data-test="fin-col"><span>69,313,000</span></div>

And the same thing for the 4 columns under the Quarterly tab on the same web page. Looking to get values 25,484,000, 23,785,000, 30,830,000, 41,935,000 on the EBIT row for columns 9/30/2022, 6/30/2022, 3/31/2022, 12/31/2021.

EBIT 1: <div class="Ta(c) Py(6px) Bxz(bb) BdB Bdc($seperatorColor) Miw(120px) Miw(100px)--pnclg D(tbc)" data-test="fin-col"><span>25,484,000</span></div>

EBIT 2: <div class="Ta(c) Py(6px) Bxz(bb) BdB Bdc($seperatorColor) Miw(120px) Miw(100px)--pnclg Bgc($lv1BgColor) fi-row:h_Bgc($hoverBgColor) D(tbc)" data-test="fin-col"><span>23,785,000</span></div>

EBIT 3: <div class="Ta(c) Py(6px) Bxz(bb) BdB Bdc($seperatorColor) Miw(120px) Miw(100px)--pnclg D(tbc)" data-test="fin-col"><span>30,830,000</span></div>

EBIT 4: <div class="Ta(c) Py(6px) Bxz(bb) BdB Bdc($seperatorColor) Miw(120px) Miw(100px)--pnclg Bgc($lv1BgColor) fi-row:h_Bgc($hoverBgColor) D(tbc)" data-test="fin-col"><span>41,935,000</span></div>

Output : dates = {9/30/2022, 9/30/2021, 9/30/2020, 9/30/2019}

datesQ = {9/30/2022, 6/30/2022, 3/31/2022, 12/31/2021}

EBIT = {122,034,000, 111,852,000, 69,964,000, 69,313,000}

EBITQ = {25,484,000, 23,785,000, 30,830,000, 41,935,000}

Where Q stands for Quarterly.

OR, it could be two hashmaps with yearlyHash = {date1: value1, date2: value2, date3: value3 and date4: value4} quarterlyHash = {date1: value1, date2: value2, date3: value3 and date4: value4}

I am wondering what's the best way to match my pattern and extract the values I want.

EDIT:

Also, unfortunately, I don't see a title = Current Liabilities in the source of the page https://finance.yahoo.com/quote/AAPL/balance-sheet?p=AAPL because it seems to be under an expandable row in the table under Total Liabilities , and I am not sure how the website calculates that field.

So, how can I extract that value? Also the quarterly values, what to do for that. That seems to be an AJAX call.

I would take a different approach, without using regex.
To get the ebit data, first find the div that contains the title "EBIT":

Element ebit = doc.select("[title=EBIT]").first();

This element's direct parent has siblings with all the needed information, which are the columns of the table:

Elements ebitData = ebit.parent().siblingElements();

and all the data can be extracted from the element's text -

for (Element e : ebitData) {
        System.out.println(e.text());           
}

Output:

122,034,000
122,034,000
111,852,000
69,964,000
69,313,000

You'll have to skip the 0th element and get all the other ones.
Use the same method to get all the other values and dates and then you can put them in arrays or hash map.

EDIT
In order to get the dates you can use a similar method -
Find the element that contains the text "Breakdown" (assuming it will remain constant) -
Elements dates = doc.select(":containsOwn(Breakdown)").first();
and now find its parent's siblings, so it can be written as -

Elements dates = doc.select(":containsOwn(Breakdown)").first().parent().siblingElements();
    
for (Element e : dates) {
    System.out.println(e.text());
}

As for the other values that you mention in your comments - I don't see it in my browser. If you have to open a new tab in the site or even expand a row, then probably it's sort of xhr or ajax request and jsoup cannot handle it.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM