[英]Is it possible to scrape all google scholar results on a particular topic and is it legal?
I have some Rexperience, but not with website coding, and think I was not able to select the correct CSS nodes to parse (I believe).我有一些 Rexperience,但没有网站编码,并且认为我无法解析 select 正确的 CSS 节点(我相信)。
library(rvest)
library(xml2)
library(selectr)
library(stringr)
library(jsonlite)
url <-'https://scholar.google.com/scholar?hl=en&as_sdt=0%2C38&q=apex+predator+conservation&btnG=&oq=apex+predator+c'
webpage <- read_html(url)
title_html <- html_nodes(webpage, 'a#rh06x-YUUvEJ')
title <- html_text(title_html)
head(title)
Ultimately, if I could scrape and divide all scholar results into a csv file with headers like 'Title', 'Author', 'Year', 'Journal', that would be great.最终,如果我可以将所有学术成果抓取并分成一个 csv 文件,其中包含“标题”、“作者”、“年份”、“期刊”等标题,那就太好了。 Any help would be much appreciated!
任何帮助将非常感激! Thanks
谢谢
Concerning your code, you almost had it - you did not select the proper element.关于您的代码,您几乎拥有它-您没有 select 正确的元素。 I believe you selected by
id
where I found html_nodes
works best when selecting by class
.我相信您是按
id
选择的,我发现html_nodes
在按class
选择时效果最好。 The classes you are looking for are gs_rt
and gs_a
.您正在寻找的课程是
gs_rt
和gs_a
。
With regex
you can then process the data to the desired format by extracting authors and years.使用
regex
,您可以通过提取作者和年份将数据处理为所需的格式。
url_name <- 'https://scholar.google.com/scholar?hl=en&as_sdt=0%2C38&q=apex+predator+conservation&btnG=&oq=apex+predator+c'
wp <- xml2::read_html(url_name)
# Extract raw data
titles <- rvest::html_text(rvest::html_nodes(wp, '.gs_rt'))
authors_years <- rvest::html_text(rvest::html_nodes(wp, '.gs_a'))
# Process data
authors <- gsub('^(.*?)\\W+-\\W+.*', '\\1', authors_years, perl = TRUE)
years <- gsub('^.*(\\d{4}).*', '\\1', authors_years, perl = TRUE)
# Make data frame
df <- data.frame(titles = titles, authors = authors, years = years, stringsAsFactors = FALSE)
Thanx Nico that was really helpful.感谢 Nico,这真的很有帮助。 however, it would just scrape data from the first page result, and the maximum results per page (as per google scholar settings) is 20. Is there any way to do scrape data from all page results??
但是,它只会从第一页结果中抓取数据,并且每页的最大结果(根据谷歌学术设置)是 20。有没有办法从所有页面结果中抓取数据? Thank you so much!
太感谢了!
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.