简体   繁体   English

如何使用Java抓取Scholar.google.com?

[英]How to webscrape scholar.google.com in Java?

I want to write a Java func grabTopResults(String f) such that grabTopResults("automata theory") returns me a list of the top 100 cited papers on scholar.google.com for "automata theory". 我想编写一个Java函数grabTopResults(String f) ,以便使grabTopResults("automata theory")我返回Scholar.google.com上有关“ automata theory”的前100名被引用论文的列表。

Does anyone have suggestions for what libraries will make my life easy? 有人对什么图书馆能让我的生活变得轻松提出建议吗?

Thanks! 谢谢!

As I'm sure Google can afford the bandwidth, I'll ignore the question of whether this is immoral/illegal/prohibited by Google's T&C 由于我确定Google可以负担得起带宽,因此我将忽略以下问题:Google的条款与条件是否不道德/非法/禁止这样做

First thing you need to do is figure out what HTTP request (or requests) you need to issue in order to obtain the page with the data you need. 您需要做的第一件事是弄清楚您需要发出哪个HTTP请求,才能获得包含所需数据的页面。 Once you've figured this out, use HttpClient to issue the same request from Java code. 一旦解决了这个问题,就可以使用HttpClient从Java代码发出相同的请求。 The previous link shows example code that explains how to do this. 前一个链接显示了示例代码,解释了如何执行此操作。

Once you've downloaded the content of the relevant page, you'll need to use a HTML parser to extract the data you're interested in. The Jericho parser suggested by peperg is a good choice. 下载相关页面的内容后,您将需要使用HTML解析器来提取您感兴趣的数据。peperg建议的Jericho解析器是一个不错的选择。

If the Google police come knocking, you've never heard of me, OK? 如果Google警察来敲门,您从未听说过我,好吗?

I use http://jericho.htmlparser.net/docs/index.html . 我使用http://jericho.htmlparser.net/docs/index.html Google Scholar doesn't have API ( http://code.google.com/p/google-ajax-apis/issues/detail?id=109 ). Google学术搜索没有API( http://code.google.com/p/google-ajax-apis/issues/detail?id=109 )。 Of course it is not allowed by Google (read terms of use. Automatic requestr are forbidden). 当然,Google不允许这样做(请阅读使用条款。禁止自动请求程序)。

Below is a bit of example code which gets the titles on the first page using the open source product TestPlan . 以下是一些示例代码,这些代码使用开源产品TestPlan在首页上显示标题。 It is a standalone product, but if you really need it I could help you integrated it into your Java code (it is written in Java itself). 它是一个独立的产品,但是如果您确实需要它,我可以帮助您将其集成到Java代码中(它是用Java本身编写的)。

GotoURL http://scholar.google.com/

SubmitForm with
    %Params:q% automate theory
end

set %Items% as response //div[@class='gs_r']
foreach %Item% in %Items%
    set %Title% as selectIn %Item% h3
    Notice %Title%
end

This produces output like the below (my IP is Germany, thus a german response). 产生如下所示的输出(我的IP是德国,因此是德国的回应)。 Obviously you could format it however you like, or write it to a file; 显然,您可以根据自己的喜好对其进行格式化,或将其写入文件。 this is just a rough test. 这只是一个粗略的测试。

00000000-00 GOTOURL http://scholar.google.com/
00000001-00 SUBMITFORM default
00000002-00 NOTICE [ZITATION] Stochastic complexity in statistical inquiry theory
00000003-00 NOTICE AUTOMATED THEORY FORMATION IN MATHEMATICS1
00000004-00 NOTICE Constraint generation via automated theory formation
00000005-00 NOTICE [BUCH] Automated theorem proving: after 25 years
00000006-00 NOTICE [BUCH] Introduction to the Theory of Computation
00000007-00 NOTICE [ZITATION] Computer-controlled systems: theory and design
00000008-00 NOTICE [BUCH] … , randomness & incompleteness: papers on algorithmic information theory
00000009-00 NOTICE [BUCH] Automatic control systems
00000010-00 NOTICE [BUCH] VLSI physical design automation: theory and practice
00000011-00 NOTICE Singular Control Systems.

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何<a>从包含链接下载pdf文件的Google Scholar中</a>提取<a>信息</a> - How to extract <a> from google scholar that contains link to download pdf file 我们如何通过 SERPAPI 获取谷歌学术引文? - How can we fetch google scholar citation through SERPAPI? 在Google学术搜索中使用HtmlUnit单击链接 - Click a link using HtmlUnit in google scholar 从Google学术搜索下载参考文献列表 - Dowloading list of References from Google Scholar 如何解决 java.lang.NoClassDefFoundError: com/google/api/client/repackaged/com/google/common/base/Strings - How to resolve java.lang.NoClassDefFoundError: com/google/api/client/repackaged/com/google/common/base/Strings Java-如何获取和阅读https://google.com的证书 - Java - How to get and read a certificate for https://google.com 如何使用谷歌 OAuth2 和 Java 将 GET 请求发送到 script.google.com web 页面? - How to send GET request with google OAuth2 and Java to script.google.com web page? 如何 map 将 Google 云数据存储实体 (com.google.cloud.datastore.Entity) 提取到自定义 Java object - How to map fetched Google cloud datastore Entity (com.google.cloud.datastore.Entity) to Custom Java object Java中的NoClassDefFoundError:com / google / common / base / Function - NoClassDefFoundError in Java: com/google/common/base/Function Java:com.google.protobuf不存在 - Java : com.google.protobuf does not exist
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM