使用 R 或 Python 按年份检索谷歌学者的搜索结果数量？

Question

I have no idea how to start so I have no code that I tried and I apologize...Is there a way to loop the following url by a sequence of number (year):我不知道如何开始，所以我没有尝试过的代码，我很抱歉......有没有办法通过数字序列（年份）循环以下 url：

https://scholar.google.com/scholar?hl=en&as_sdt=0%2C22&as_ylo=2021&q=%22TERM1%22+AND+%22TERM2%22&btnG= https://scholar.google.com/scholar?hl=en&as_sdt=0%2C22&as_ylo=2021&q=%22TERM1%22+AND+%22TERM2%22&btnG=

where 2021 is replace by a sequence and just get the simple number of search results by year?哪里 2021 被一个序列替换，只得到简单数量的搜索结果？

Thank you so much!太感谢了！

Edit:编辑：

This works for Google search but not for Google Scholar...Generates an empty set.这适用于谷歌搜索，但不适用于谷歌学术......生成一个空集。

ua <- "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36"
url <- "https://scholar.google.com/scholar?hl=en&as_sdt=0%2C22&as_ylo=2021&q=%22causal+inference%22+AND+%22statistics%22&btnG="
doc <- htmlTreeParse(getURL(url, httpheader = list(`User-Agent` = ua)), useInternalNodes = TRUE)

nodes <- getNodeSet(doc, "//div[@id='result-stats']")
nodes

Answer 1

There is an approximate results count below the search bar.搜索栏下方有一个近似的结果计数。 A lot of the attribute values look dynamic so I would look for a relationship between more stable elements and attributes (based on experience).许多属性值看起来是动态的，所以我会寻找更稳定的元素和属性之间的关系（基于经验）。 In this case, I would use :contains() to look for the text with "results" in a div.在这种情况下，我会使用 :contains() 在 div 中查找带有“结果”的文本。 I would anchor this div by a css selector list that references the expected div location with respect to the search bar and the elements in between.我将通过一个 css 选择器列表来锚定这个 div，该列表引用相对于搜索栏和其间元素的预期 div 位置。

library(rvest)
library(httr)

headers = c("User-Agent" = "Safari/537.36")
r <- httr::GET(url = "https://scholar.google.com/scholar?hl=en&as_sdt=0%2C22&as_ylo=2021&q=%22TERM1%22+AND+%22TERM2%22&btnG=", 
               httr::add_headers(.headers=headers))
r |> content() |> html_element('form[method=post] + div div > div:contains("results")') |> html_text()

You can then perhaps using a simple regex to extract the result count eg然后，您也许可以使用简单的正则表达式来提取结果计数，例如

library(stringr)

r |>
  content() |>
  html_element('form[method=post] + div div > div:contains("results")') |>
  html_text() |>
  str_extract("(\\d+)") |>
  as.integer()

使用 R 或 Python 按年份检索谷歌学者的搜索结果数量？

问题描述

1 个解决方案

解决方案1
1 2022-05-19 04:51:17

使用 R 或 Python 按年份检索谷歌学者的搜索结果数量？

问题描述

1 个解决方案

解决方案1 1 2022-05-19 04:51:17

解决方案1
1 2022-05-19 04:51:17