如何<a>从包含链接下载pdf文件的Google Scholar中</a>提取<a>信息</a>

Question

I need to extract the tag from html of google scholar. 我需要从Google Scholar的HTML中提取标签。 I've written the script but it extracts all the 's. 我已经编写了脚本，但是它提取了所有的。 and I cant find any way to extract the specific tag where the download link of the paper is resting. 而且我找不到任何方法来提取文件的下载链接所在的特定标签。 Please Help.! 请帮忙。！ Below is the code 下面是代码

  public static void main(String[] args) throws IOException {
 Document doc;
        try {


            doc = Jsoup.connect("https://scholar.google.com.pk/scholar?q=Bergmark%2C+D.+%282000%29.+Automatic+extraction+of+reference+linking+information+from+online+documents.+Technical+Report+CSTR2000-1821%2C+Cornell+Digital+Library+Research+Group&btnG=&hl=en&as_sdt=0%2C5").get();


            String title = doc.title();
            System.out.println("title : " + title);


            Elements links = doc.select("a[href]");
        // Elements link = doc.select(".pdf");
            for (Element link : links) {

                // get the value from href attribute
                System.out.println("\nlink : " + link.attr("href"));
               System.out.println("text : " + link.text());

            }

        } catch (IOException e) {
            e.printStackTrace();
        }

And here this is the structure of this tag : 这是此标记的结构：

<a href="https://ecommons.cornell.edu/bitstream/handle/1813/5809/2000-1821.pdf?sequence=1" data-clk="hl=en&amp;sa=T&amp;oi=gga&amp;ct=gga&amp;cd=0&amp;ei=YBMXWYbRO8a72Ab_2o24CQ"><span class="gs_ctg2">[PDF]</span> cornell.edu</a>

Answer 1

Use div.gs_ggsd and a[href] as css query 使用div.gs_ggsd和a[href]作为CSS查询

Here 这里

div.gs_ggsd => Select all the div Tag that have class name gs_ggsd

Example : 范例：

try {
    Document doc = Jsoup
            .connect("https://scholar.google.com.pk/scholar?q=Bangla+Speech+Recognition")
            .userAgent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.152 Safari/537.36")
            .get();

    String title = doc.title();
    System.out.println("title : " + title);

    Elements links = doc.select("div.gs_ggsd").select("a[href]");

    for (Element link : links) {
        System.out.println("\nlink : " + link.attr("href"));
        System.out.println("text : " + link.text());
    }

} catch (IOException e) {
    e.printStackTrace();
}

Read More : https://jsoup.org/cookbook/extracting-data/selector-syntax 阅读更多： https : //jsoup.org/cookbook/extracting-data/selector-syntax

如何<a>从包含链接下载pdf文件的Google Scholar中</a>提取<a>信息</a>

问题描述

1 个解决方案

解决方案1
0 已采纳 2017-05-14 17:40:55

如何<a>从包含链接下载pdf文件的Google Scholar中</a>提取<a>信息</a>

问题描述

1 个解决方案

解决方案1 0 已采纳 2017-05-14 17:40:55

解决方案1
0 已采纳 2017-05-14 17:40:55