简体   繁体   English

是否可以抓取特定主题的所有谷歌学术搜索结果,是否合法?

[英]Is it possible to scrape all google scholar results on a particular topic and is it legal?

I have some Rexperience, but not with website coding, and think I was not able to select the correct CSS nodes to parse (I believe).我有一些 Rexperience,但没有网站编码,并且认为我无法解析 select 正确的 CSS 节点(我相信)。

library(rvest)
library(xml2)
library(selectr)
library(stringr)
library(jsonlite)

url <-'https://scholar.google.com/scholar?hl=en&as_sdt=0%2C38&q=apex+predator+conservation&btnG=&oq=apex+predator+c'
webpage <- read_html(url)

title_html <- html_nodes(webpage, 'a#rh06x-YUUvEJ')
title <- html_text(title_html)
head(title)

Ultimately, if I could scrape and divide all scholar results into a csv file with headers like 'Title', 'Author', 'Year', 'Journal', that would be great.最终,如果我可以将所有学术成果抓取并分成一个 csv 文件,其中包含“标题”、“作者”、“年份”、“期刊”等标题,那就太好了。 Any help would be much appreciated!任何帮助将非常感激! Thanks谢谢

Concerning your code, you almost had it - you did not select the proper element.关于您的代码,您几乎拥有它-您没有 select 正确的元素。 I believe you selected by id where I found html_nodes works best when selecting by class .我相信您是按id选择的,我发现html_nodes在按class选择时效果最好。 The classes you are looking for are gs_rt and gs_a .您正在寻找的课程是gs_rtgs_a

With regex you can then process the data to the desired format by extracting authors and years.使用regex ,您可以通过提取作者和年份将数据处理为所需的格式。

url_name <- 'https://scholar.google.com/scholar?hl=en&as_sdt=0%2C38&q=apex+predator+conservation&btnG=&oq=apex+predator+c'
wp <- xml2::read_html(url_name)
# Extract raw data
titles <- rvest::html_text(rvest::html_nodes(wp, '.gs_rt'))
authors_years <- rvest::html_text(rvest::html_nodes(wp, '.gs_a'))
# Process data
authors <- gsub('^(.*?)\\W+-\\W+.*', '\\1', authors_years, perl = TRUE)
years <- gsub('^.*(\\d{4}).*', '\\1', authors_years, perl = TRUE)
# Make data frame
df <- data.frame(titles = titles, authors = authors, years = years, stringsAsFactors = FALSE)

Thanx Nico that was really helpful.感谢 Nico,这真的很有帮助。 however, it would just scrape data from the first page result, and the maximum results per page (as per google scholar settings) is 20. Is there any way to do scrape data from all page results??但是,它只会从第一页结果中抓取数据,并且每页的最大结果(根据谷歌学术设置)是 20。有没有办法从所有页面结果中抓取数据? Thank you so much!太感谢了!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM