是否可以抓取特定主题的所有谷歌学术搜索结果，是否合法？

Question

I have some Rexperience, but not with website coding, and think I was not able to select the correct CSS nodes to parse (I believe).我有一些 Rexperience，但没有网站编码，并且认为我无法解析 select 正确的 CSS 节点（我相信）。

library(rvest)
library(xml2)
library(selectr)
library(stringr)
library(jsonlite)

url <-'https://scholar.google.com/scholar?hl=en&as_sdt=0%2C38&q=apex+predator+conservation&btnG=&oq=apex+predator+c'
webpage <- read_html(url)

title_html <- html_nodes(webpage, 'a#rh06x-YUUvEJ')
title <- html_text(title_html)
head(title)

Ultimately, if I could scrape and divide all scholar results into a csv file with headers like 'Title', 'Author', 'Year', 'Journal', that would be great.最终，如果我可以将所有学术成果抓取并分成一个 csv 文件，其中包含“标题”、“作者”、“年份”、“期刊”等标题，那就太好了。 Any help would be much appreciated!任何帮助将非常感激！ Thanks谢谢

Answer 1

Concerning your code, you almost had it - you did not select the proper element.关于您的代码，您几乎拥有它-您没有 select 正确的元素。 I believe you selected by id where I found html_nodes works best when selecting by class .我相信您是按id选择的，我发现html_nodes在按class选择时效果最好。 The classes you are looking for are gs_rt and gs_a .您正在寻找的课程是gs_rt和gs_a 。

With regex you can then process the data to the desired format by extracting authors and years.使用regex ，您可以通过提取作者和年份将数据处理为所需的格式。

url_name <- 'https://scholar.google.com/scholar?hl=en&as_sdt=0%2C38&q=apex+predator+conservation&btnG=&oq=apex+predator+c'
wp <- xml2::read_html(url_name)
# Extract raw data
titles <- rvest::html_text(rvest::html_nodes(wp, '.gs_rt'))
authors_years <- rvest::html_text(rvest::html_nodes(wp, '.gs_a'))
# Process data
authors <- gsub('^(.*?)\\W+-\\W+.*', '\\1', authors_years, perl = TRUE)
years <- gsub('^.*(\\d{4}).*', '\\1', authors_years, perl = TRUE)
# Make data frame
df <- data.frame(titles = titles, authors = authors, years = years, stringsAsFactors = FALSE)

Answer 2

Thanx Nico that was really helpful.感谢 Nico，这真的很有帮助。 however, it would just scrape data from the first page result, and the maximum results per page (as per google scholar settings) is 20. Is there any way to do scrape data from all page results??但是，它只会从第一页结果中抓取数据，并且每页的最大结果（根据谷歌学术设置）是 20。有没有办法从所有页面结果中抓取数据？ Thank you so much!太感谢了！

是否可以抓取特定主题的所有谷歌学术搜索结果，是否合法？

问题描述

1 个解决方案

解决方案1
0 已采纳 2019-10-01 21:11:56

解决方案2
-2 2023-01-20 06:30:02

是否可以抓取特定主题的所有谷歌学术搜索结果，是否合法？

问题描述

1 个解决方案

解决方案1 0 已采纳 2019-10-01 21:11:56

解决方案2 -2 2023-01-20 06:30:02

解决方案1
0 已采纳 2019-10-01 21:11:56

解决方案2
-2 2023-01-20 06:30:02