簡體   English   中英

如何<a>從包含鏈接下載pdf文件的Google Scholar中</a>提取<a>信息</a>

[英]How to extract <a> from google scholar that contains link to download pdf file

我需要從Google Scholar的HTML中提取標簽。 我已經編寫了腳本,但是它提取了所有的。 而且我找不到任何方法來提取文件的下載鏈接所在的特定標簽。 請幫忙。! 下面是代碼

  public static void main(String[] args) throws IOException {
 Document doc;
        try {


            doc = Jsoup.connect("https://scholar.google.com.pk/scholar?q=Bergmark%2C+D.+%282000%29.+Automatic+extraction+of+reference+linking+information+from+online+documents.+Technical+Report+CSTR2000-1821%2C+Cornell+Digital+Library+Research+Group&btnG=&hl=en&as_sdt=0%2C5").get();


            String title = doc.title();
            System.out.println("title : " + title);


            Elements links = doc.select("a[href]");
        // Elements link = doc.select(".pdf");
            for (Element link : links) {

                // get the value from href attribute
                System.out.println("\nlink : " + link.attr("href"));
               System.out.println("text : " + link.text());

            }

        } catch (IOException e) {
            e.printStackTrace();
        }

這是此標記的結構:

<a href="https://ecommons.cornell.edu/bitstream/handle/1813/5809/2000-1821.pdf?sequence=1" data-clk="hl=en&amp;sa=T&amp;oi=gga&amp;ct=gga&amp;cd=0&amp;ei=YBMXWYbRO8a72Ab_2o24CQ"><span class="gs_ctg2">[PDF]</span> cornell.edu</a>

使用div.gs_ggsda[href]作為CSS查詢

這里

div.gs_ggsd => Select all the div Tag that have class name gs_ggsd

范例:

try {
    Document doc = Jsoup
            .connect("https://scholar.google.com.pk/scholar?q=Bangla+Speech+Recognition")
            .userAgent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.152 Safari/537.36")
            .get();

    String title = doc.title();
    System.out.println("title : " + title);

    Elements links = doc.select("div.gs_ggsd").select("a[href]");

    for (Element link : links) {
        System.out.println("\nlink : " + link.attr("href"));
        System.out.println("text : " + link.text());
    }

} catch (IOException e) {
    e.printStackTrace();
}

閱讀更多: https : //jsoup.org/cookbook/extracting-data/selector-syntax

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM