简体   繁体   English

如何从 HTML web Android 上的 Java 页面表中仅抓取四个数值?

[英]How to scrape just four numeric values from a HTML web page's table on Java for Android?

Here's my current code:这是我当前的代码:

 private static String fetch(String url) throws MalformedURLException, IOException, UnsupportedEncodingException {
String userAgent1 = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36 OPR/56.0.3051.43";
try {
    Document doc1 = Jsoup.connect(url).userAgent(userAgent1).get();
    Elements divTags = doc1.getElementsByTag("div");
    String re = "^<div class=\\\"Ta\\(c\\) Py\\(6px\\) Bxz\\(bb\\) BdB Bdc\\(\\$seperatorColor\\) Miw\\(120px\\) Miw\\(100px\\)\\-\\-pnclg D\\(tbc\\)\\\" data-test=\\\"fin-col\\\"><span>.*</span></div>$";
    
    for (Element div : divTags) {
        Pattern pattern = Pattern.compile(re, Pattern.DOTALL);
        Matcher matcher = pattern.matcher(div.html());

        if (matcher.find()) {
            String data = matcher.group(1);
            Log.d("Matched: ", data);
        }
        else {
            Log.d("Nothing Matched: ", "");
        }
    }
} catch (Exception e) {
    Log.e("err-new", "err", e);
}
return "";

} }

This function takes a URL as input, in our case: https://finance.yahoo.com/quote/AAPL/financials?p=AAPL&guccounter=2 and extracts all the div tags using JSOUP.这个 function 将 URL 作为输入,在我们的例子中是: https://finance.yahoo.com/quote/AAPL/financials?p=AAPL&guccounter=2并使用 JSOUP 提取所有 div 标签。

And then, I need to extract these values using Pattern matching.然后,我需要使用模式匹配来提取这些值。 But, in my code above, all I get is that "Nothing matched: ".但是,在我上面的代码中,我得到的只是“没有匹配:”。

Here's the web page from which I am interested in getting the four numeric values corresponding to the first four yearly columns, corresponding to the row named EBIT .这是 web 页面,我有兴趣从中获取对应于前四个年度列的四个数值,对应于名为EBIT的行。 (Stands for Earnings Before Interest and Taxes) (代表息税前利润)

Link : https://finance.yahoo.com/quote/AAPL/financials?p=AAPL&guccounter=2链接https://finance.yahoo.com/quote/AAPL/financials?p=AAPL&guccounter=2

Input : Looking to get values 122,034,000, 111,852,000, 69,964,000, 69,313,000 on the EBIT row for columns 9/30/2022, 9/30/2021, 9/30/2020, 9/30/2019.输入:希望在 2022 年 9 月 30 日、2021 年 9 月 30 日、2020 年 9 月 30 日、2019 年 9 月 30 日的EBIT行中获取值 122,034,000、111,852,000、69,964,000、69,313,000。

On Inspect , these values are under the following <div> tags.Inspect上,这些值位于以下<div>标签下。

EBIT 1: <div class="Ta(c) Py(6px) Bxz(bb) BdB Bdc($seperatorColor) Miw(120px) Miw(100px)--pnclg D(tbc)" data-test="fin-col"><span>122,034,000</span></div> EBIT 1: <div class="Ta(c) Py(6px) Bxz(bb) BdB Bdc($seperatorColor) Miw(120px) Miw(100px)--pnclg D(tbc)" data-test="fin-col"><span>122,034,000</span></div>

EBIT 2: <div class="Ta(c) Py(6px) Bxz(bb) BdB Bdc($seperatorColor) Miw(120px) Miw(100px)--pnclg Bgc($lv1BgColor) fi-row:h_Bgc($hoverBgColor) D(tbc)" data-test="fin-col"><span>111,852,000</span></div> EBIT 2: <div class="Ta(c) Py(6px) Bxz(bb) BdB Bdc($seperatorColor) Miw(120px) Miw(100px)--pnclg Bgc($lv1BgColor) fi-row:h_Bgc($hoverBgColor) D(tbc)" data-test="fin-col"><span>111,852,000</span></div>

EBIT 3: <div class="Ta(c) Py(6px) Bxz(bb) BdB Bdc($seperatorColor) Miw(120px) Miw(100px)--pnclg D(tbc)" data-test="fin-col"><span>69,964,000</span></div>息税前利润 3: <div class="Ta(c) Py(6px) Bxz(bb) BdB Bdc($seperatorColor) Miw(120px) Miw(100px)--pnclg D(tbc)" data-test="fin-col"><span>69,964,000</span></div>

EBIT 4: <div class="Ta(c) Py(6px) Bxz(bb) BdB Bdc($seperatorColor) Miw(120px) Miw(100px)--pnclg Bgc($lv1BgColor) fi-row:h_Bgc($hoverBgColor) D(tbc)" data-test="fin-col"><span>69,313,000</span></div> EBIT 4: <div class="Ta(c) Py(6px) Bxz(bb) BdB Bdc($seperatorColor) Miw(120px) Miw(100px)--pnclg Bgc($lv1BgColor) fi-row:h_Bgc($hoverBgColor) D(tbc)" data-test="fin-col"><span>69,313,000</span></div>

And the same thing for the 4 columns under the Quarterly tab on the same web page.同一 web 页面上“ Quarterly ”选项卡下的 4 列也是如此。 Looking to get values 25,484,000, 23,785,000, 30,830,000, 41,935,000 on the EBIT row for columns 9/30/2022, 6/30/2022, 3/31/2022, 12/31/2021.希望在 2022 年 9 月 30 日、2022 年 6 月 30 日、2022 年 3 月 31 日、2021 年 12 月 31 日的EBIT行中获取值 25,484,000、23,785,000、30,830,000、41,935,000。

EBIT 1: <div class="Ta(c) Py(6px) Bxz(bb) BdB Bdc($seperatorColor) Miw(120px) Miw(100px)--pnclg D(tbc)" data-test="fin-col"><span>25,484,000</span></div> EBIT 1: <div class="Ta(c) Py(6px) Bxz(bb) BdB Bdc($seperatorColor) Miw(120px) Miw(100px)--pnclg D(tbc)" data-test="fin-col"><span>25,484,000</span></div>

EBIT 2: <div class="Ta(c) Py(6px) Bxz(bb) BdB Bdc($seperatorColor) Miw(120px) Miw(100px)--pnclg Bgc($lv1BgColor) fi-row:h_Bgc($hoverBgColor) D(tbc)" data-test="fin-col"><span>23,785,000</span></div> EBIT 2: <div class="Ta(c) Py(6px) Bxz(bb) BdB Bdc($seperatorColor) Miw(120px) Miw(100px)--pnclg Bgc($lv1BgColor) fi-row:h_Bgc($hoverBgColor) D(tbc)" data-test="fin-col"><span>23,785,000</span></div>

EBIT 3: <div class="Ta(c) Py(6px) Bxz(bb) BdB Bdc($seperatorColor) Miw(120px) Miw(100px)--pnclg D(tbc)" data-test="fin-col"><span>30,830,000</span></div>息税前利润 3: <div class="Ta(c) Py(6px) Bxz(bb) BdB Bdc($seperatorColor) Miw(120px) Miw(100px)--pnclg D(tbc)" data-test="fin-col"><span>30,830,000</span></div>

EBIT 4: <div class="Ta(c) Py(6px) Bxz(bb) BdB Bdc($seperatorColor) Miw(120px) Miw(100px)--pnclg Bgc($lv1BgColor) fi-row:h_Bgc($hoverBgColor) D(tbc)" data-test="fin-col"><span>41,935,000</span></div> EBIT 4: <div class="Ta(c) Py(6px) Bxz(bb) BdB Bdc($seperatorColor) Miw(120px) Miw(100px)--pnclg Bgc($lv1BgColor) fi-row:h_Bgc($hoverBgColor) D(tbc)" data-test="fin-col"><span>41,935,000</span></div>

Output : dates = {9/30/2022, 9/30/2021, 9/30/2020, 9/30/2019} Output : 日期 = {9/30/2022, 9/30/2021, 9/30/2020, 9/30/2019}

datesQ = {9/30/2022, 6/30/2022, 3/31/2022, 12/31/2021} datesQ = {9/30/2022, 6/30/2022, 3/31/2022, 12/31/2021}

EBIT = {122,034,000, 111,852,000, 69,964,000, 69,313,000}息税前利润 = {122,034,000, 111,852,000, 69,964,000, 69,313,000}

EBITQ = {25,484,000, 23,785,000, 30,830,000, 41,935,000} EBITQ = {25,484,000, 23,785,000, 30,830,000, 41,935,000}

Where Q stands for Quarterly.其中Q代表季度。

OR, it could be two hashmaps with yearlyHash = {date1: value1, date2: value2, date3: value3 and date4: value4} quarterlyHash = {date1: value1, date2: value2, date3: value3 and date4: value4}或者,它可以是两个 hashmaps with yearlyHash = {date1: value1, date2: value2, date3: value3 and date4: value4} quarterlyHash = {date1: value1, date2: value2, date3: value3 and date4: value4}

My existing code is broken.我现有的代码已损坏。 Basically, I've used JSoup to get all the javascript related tags and used a pattern matcher to get the String values I wanted.基本上,我使用 JSoup 获取所有与 javascript 相关的标签,并使用模式匹配器获取我想要的字符串值。 However, the page I'm parsing now seems to look like some values in that tag are encrypted strings that can't be parsed anymore.但是,我正在解析的页面现在看起来好像该标记中的某些值是无法再解析的加密字符串。

My use case is not that complex as you can imagine.我的用例并不像您想象的那么复杂。 I just need the dates and the 4 values corresponding to that one row.我只需要与该行对应的日期和 4 个值。 Even if it's a non-standard, non-optimized solution, I am fine with that.即使它是一个非标准的、非优化的解决方案,我也可以接受。

Thank you.谢谢你。

Annoyingly the annual data is on the page as loaded and the quarterly data is loaded with a AJAX call triggered by clicking on the "Quarterly" button.令人恼火的是,年度数据在页面上已加载,而季度数据是通过单击“季度”按钮触发的 AJAX 调用加载的。 Anyway, the following code will do the job:无论如何,以下代码将完成这项工作:

import java.io.BufferedInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.Reader;
import java.io.UnsupportedEncodingException;
import java.net.HttpURLConnection;
import java.net.MalformedURLException;
import java.net.URL;
import java.text.NumberFormat;
import java.util.List;
import java.util.Locale;
import java.util.Map;
import java.util.Scanner;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import com.google.gson.Gson;

public class App {
    private static final String PAGE_URL = "https://finance.yahoo.com/quote/AAPL/financials?p=AAPL&guccounter=2";
    private static final String DATA_URL = "https://query1.finance.yahoo.com/ws/fundamentals-timeseries/v1/finance/timeseries/AAPL?lang=en-US&region=US&symbol=AAPL&padTimeSeries=true&type=quarterlyEBIT&merge=false&period1=493590046&period2=1674660504&corsDomain=finance.yahoo.com";

    private static final String REGEX_YAHOO_PAGE_EBIT = "^.*ttm</span>.*?<span>(.*?)</span>.*?<span>(.*?)</span>.*?<span>(.*?)</span>.*?<span>(.*?)</span>.*?EBIT</span></div><div.*?<span>(.*?)</span>.*?<span>(.*?)</span>.*?<span>(.*?)</span>.*?<span>(.*?)</span>.*?<span>(.*?)</span>.*$";
    private static final Pattern PATTERN_YAHOO_PAGE_REGEX = Pattern.compile(REGEX_YAHOO_PAGE_EBIT, Pattern.DOTALL);

    private static final Gson GSON = new Gson();

    private static final NumberFormat NUMBER_FORMAT = NumberFormat.getInstance(new Locale("en", "US"));

    public static void main(String[] args) throws IOException {
        String pageContent = fetch(PAGE_URL);
        Matcher m = PATTERN_YAHOO_PAGE_REGEX.matcher(pageContent);
        if (m.matches()) {
            System.out.println("Annual values");

            System.out.println(m.group(1) + ": " + m.group(6));
            System.out.println(m.group(2) + ": " + m.group(7));
            System.out.println(m.group(3) + ": " + m.group(8));
            System.out.println(m.group(4) + ": " + m.group(9));
        }

        // the quarterly data is not on the page. it is rendered dynamically from this
        // AJAX call
        String quarterlyData = fetch(DATA_URL);
        System.out.println("Quarterly values");
        Map map = GSON.fromJson(quarterlyData, Map.class);
        List<Map> result = (List<Map>) ((Map) map.get("timeseries")).get("result");
        for (Map entry : result) {
            Map meta = (Map) entry.get("meta");
            if (((List<String>) meta.get("type")).get(0).equals("quarterlyEBIT")) {
                List<Map<String, Object>> quarterlyEBIT = (List) entry.get("quarterlyEBIT");
                for (Map<String, Object> cell : quarterlyEBIT) {
                    System.out.print(cell.get("asOfDate") + ": ");
                    String fullNumberString = NUMBER_FORMAT
                            .format(((Map<String, Double>) cell.get("reportedValue")).get("raw"));
                    System.out.println(fullNumberString.substring(0, fullNumberString.length() - 4));

                }

            }
        }

    }

    private static String fetch(String url) throws MalformedURLException, IOException, UnsupportedEncodingException {
        URL pageUrl = new URL(url);
        HttpURLConnection pageConnection = (HttpURLConnection) pageUrl.openConnection();
        try {
            InputStream inputStream = new BufferedInputStream(pageConnection.getInputStream());
            int bufferSize = 1024;
            char[] buffer = new char[bufferSize];
            StringBuilder out = new StringBuilder();
            Reader in = new InputStreamReader(inputStream, "UTF-8");
            for (int numRead; (numRead = in.read(buffer, 0, buffer.length)) > 0;) {
                out.append(buffer, 0, numRead);
            }
            return out.toString();
        } finally {
            pageConnection.disconnect();
        }
    }
}

Output: Output:

Annual values
9/30/2022: 122,034,000
9/30/2021: 111,852,000
9/30/2020: 69,964,000
9/30/2019: 69,313,000
Quarterly values
2021-12-31: 41,935,000
2022-03-31: 30,830,000
2022-06-30: 23,785,000
2022-09-30: 25,484,000

If you prefer Apache HttpClient (v4 here) then fetch() can be coded as follows:如果您更喜欢 Apache HttpClient(此处为 v4),则可以将 fetch() 编码如下:

import org.apache.http.HttpEntity;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;

    private static String fetch(String url) throws MalformedURLException, IOException, UnsupportedEncodingException {
        CloseableHttpClient httpclient = HttpClients.createDefault();
        HttpGet httpGet = new HttpGet(url);
        CloseableHttpResponse response = httpclient.execute(httpGet);
        try {
            HttpEntity entity = response.getEntity();
            return EntityUtils.toString(entity);
        } finally {
            response.close();
        }
    }

I think I answered a question like Yours a couple of days ago, take a look here我想我几天前回答了像你这样的问题,看看这里

I guess you can use regular expression to match the div tags我想你可以使用正则表达式来匹配 div 标签

Please change your regular expression to match the span element and extract the text inside it.请更改您的正则表达式以匹配 span 元素并提取其中的文本。

ex:前任:

Elements spans = doc1.select("div.Ta(c) span");
for (Element span : spans) {
    String data = span.text();
    Log.d("Matched: ", data);
}

Also you might use Jsoup's elements class & filter method to filter the divs to extract the span elements.您也可以使用 Jsoup 的 elements class & filter 方法来过滤 div 以提取 span 元素。

Elements divs = doc1.select("div[class*=Ta\\(c\\)]");
Elements spanElements = divs.filter(element -> element.select("span").size()>0);
for (Element span : spanElements) {
    String data = span.text();
    Log.d("Matched: ", data);
}

Using Css selectors will be also possible.也可以使用 Css 选择器。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM