使用 rvest 抓取网站：“当前页面似乎不是 html。”

Question

I try to access this website: https://www.apa.org/pubs/journals/browse?query=Title:*&type=journal我尝试访问这个网站： https://www.apa.org/pubs/journals/browse?query=Title:*&type=journal

However, I get the error message: Current page doesn't appear to be html.但是，我收到错误消息：当前页面似乎不是 html。

I thus cannot proceed to scrape the website with html_nodes etc.因此，我无法继续使用html_nodes等来抓取网站。

This is my code:这是我的代码：

apa_url <- "https://www.apa.org/pubs/journals/browse?query=Title:*&type=journal"

apa_page <- rvest::html_session(apa_url,
                                httr::user_agent("Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.20 (KHTML, like Gecko) Chrome/11.0.672.2 Safari/534.20"))

If you know how to fix it, I would be grateful for your help!如果您知道如何解决它，我将不胜感激您的帮助！

Answer 1

You haven't shared what you want to scrape but you don't need to create a session.您尚未共享要抓取的内容，但不需要创建 session。

For example, to get titles of the journal in the first page you can do:例如，要在第一页中获取期刊的标题，您可以执行以下操作：

library(rvest)
apa_url <- "https://www.apa.org/pubs/journals/browse?query=Title:*&type=journal"

apa_url %>%
  read_html() %>%
  html_nodes('section.sresults li a') %>%
  html_text()

# [1] "American Journal of Orthopsychiatry - APA Publishing | APA"               
# [2] "American Psychologist Journal - APA Publishing | APA"                     
# [3] "Archives of Scientific Psychology"                                     
# [4] "Asian American Journal of Psychology"                                     
# [5] "Behavior Analysis: Research and Practice"                          
# [6] "Behavioral Development"       
#...
#...

使用 rvest 抓取网站：“当前页面似乎不是 html。”

问题描述

1 个解决方案

解决方案1
1 已采纳 2021-01-04 09:05:02

使用 rvest 抓取网站：“当前页面似乎不是 html。”

问题描述

1 个解决方案

解决方案1 1 已采纳 2021-01-04 09:05:02

解决方案1
1 已采纳 2021-01-04 09:05:02