简体   繁体   English

如何<a>从包含链接下载pdf文件的Google Scholar中</a>提取<a>信息</a>

[英]How to extract <a> from google scholar that contains link to download pdf file

I need to extract the tag from html of google scholar. 我需要从Google Scholar的HTML中提取标签。 I've written the script but it extracts all the 's. 我已经编写了脚本,但是它提取了所有的。 and I cant find any way to extract the specific tag where the download link of the paper is resting. 而且我找不到任何方法来提取文件的下载链接所在的特定标签。 Please Help.! 请帮忙。! Below is the code 下面是代码

  public static void main(String[] args) throws IOException {
 Document doc;
        try {


            doc = Jsoup.connect("https://scholar.google.com.pk/scholar?q=Bergmark%2C+D.+%282000%29.+Automatic+extraction+of+reference+linking+information+from+online+documents.+Technical+Report+CSTR2000-1821%2C+Cornell+Digital+Library+Research+Group&btnG=&hl=en&as_sdt=0%2C5").get();


            String title = doc.title();
            System.out.println("title : " + title);


            Elements links = doc.select("a[href]");
        // Elements link = doc.select(".pdf");
            for (Element link : links) {

                // get the value from href attribute
                System.out.println("\nlink : " + link.attr("href"));
               System.out.println("text : " + link.text());

            }

        } catch (IOException e) {
            e.printStackTrace();
        }

And here this is the structure of this tag : 这是此标记的结构:

<a href="https://ecommons.cornell.edu/bitstream/handle/1813/5809/2000-1821.pdf?sequence=1" data-clk="hl=en&amp;sa=T&amp;oi=gga&amp;ct=gga&amp;cd=0&amp;ei=YBMXWYbRO8a72Ab_2o24CQ"><span class="gs_ctg2">[PDF]</span> cornell.edu</a>

Use div.gs_ggsd and a[href] as css query 使用div.gs_ggsda[href]作为CSS查询

Here 这里

div.gs_ggsd => Select all the div Tag that have class name gs_ggsd

Example : 范例:

try {
    Document doc = Jsoup
            .connect("https://scholar.google.com.pk/scholar?q=Bangla+Speech+Recognition")
            .userAgent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.152 Safari/537.36")
            .get();

    String title = doc.title();
    System.out.println("title : " + title);

    Elements links = doc.select("div.gs_ggsd").select("a[href]");

    for (Element link : links) {
        System.out.println("\nlink : " + link.attr("href"));
        System.out.println("text : " + link.text());
    }

} catch (IOException e) {
    e.printStackTrace();
}

Read More : https://jsoup.org/cookbook/extracting-data/selector-syntax 阅读更多: https : //jsoup.org/cookbook/extracting-data/selector-syntax

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM