[英]How to extract <a> from google scholar that contains link to download pdf file
I need to extract the tag from html of google scholar. 我需要从Google Scholar的HTML中提取标签。 I've written the script but it extracts all the 's.
我已经编写了脚本,但是它提取了所有的。 and I cant find any way to extract the specific tag where the download link of the paper is resting.
而且我找不到任何方法来提取文件的下载链接所在的特定标签。 Please Help.!
请帮忙。! Below is the code
下面是代码
public static void main(String[] args) throws IOException {
Document doc;
try {
doc = Jsoup.connect("https://scholar.google.com.pk/scholar?q=Bergmark%2C+D.+%282000%29.+Automatic+extraction+of+reference+linking+information+from+online+documents.+Technical+Report+CSTR2000-1821%2C+Cornell+Digital+Library+Research+Group&btnG=&hl=en&as_sdt=0%2C5").get();
String title = doc.title();
System.out.println("title : " + title);
Elements links = doc.select("a[href]");
// Elements link = doc.select(".pdf");
for (Element link : links) {
// get the value from href attribute
System.out.println("\nlink : " + link.attr("href"));
System.out.println("text : " + link.text());
}
} catch (IOException e) {
e.printStackTrace();
}
And here this is the structure of this tag : 这是此标记的结构:
<a href="https://ecommons.cornell.edu/bitstream/handle/1813/5809/2000-1821.pdf?sequence=1" data-clk="hl=en&sa=T&oi=gga&ct=gga&cd=0&ei=YBMXWYbRO8a72Ab_2o24CQ"><span class="gs_ctg2">[PDF]</span> cornell.edu</a>
Use div.gs_ggsd
and a[href]
as css query 使用
div.gs_ggsd
和a[href]
作为CSS查询
Here 这里
div.gs_ggsd => Select all the div Tag that have class name gs_ggsd
Example : 范例:
try {
Document doc = Jsoup
.connect("https://scholar.google.com.pk/scholar?q=Bangla+Speech+Recognition")
.userAgent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.152 Safari/537.36")
.get();
String title = doc.title();
System.out.println("title : " + title);
Elements links = doc.select("div.gs_ggsd").select("a[href]");
for (Element link : links) {
System.out.println("\nlink : " + link.attr("href"));
System.out.println("text : " + link.text());
}
} catch (IOException e) {
e.printStackTrace();
}
Read More : https://jsoup.org/cookbook/extracting-data/selector-syntax 阅读更多: https : //jsoup.org/cookbook/extracting-data/selector-syntax
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.