简体   繁体   English

R: 如何对多个 URL 使用 map(read_html)?

[英]R: How to use map(read_html) for more than one URL?

i'm trying to scrape title, date and content from multiple article URLs.我正在尝试从多个文章 URL 中抓取标题、日期和内容。 I've had success using the following code below for a single string URL but I keep getting errors when trying to pass more than one string.我已经成功地将下面的代码用于单个字符串 URL,但是在尝试传递多个字符串时我不断收到错误消息。

Working code:工作代码:

article <- single_url  %>% purrr::map(read_html)
title <-
  article %>% map_chr(. %>% html_node("title") %>% html_text())
content <-
  article %>% map_chr(. %>% html_nodes("p") %>% html_text() %>% paste(., collapse = ""))
time <- 
  article %>% map_chr(. %>% html_nodes("time") %>% html_text() %>% paste(., collapse = ""))
article_table <- data.frame("Title" = title, "Content" = content, "New Time" = time)

I've had no success with trying to modify for passing each string that is in my vector variable 'Direct.Link' (see example data below)我没有成功尝试修改以传递我的向量变量“Direct.Link”中的每个字符串(请参见下面的示例数据)

Example Data:示例数据:

test <- structure(list(Participant.Name = c("Participant 1", "Participant 1", 
                                    "Participant 2"), Direct.Link = c("https://chicago.suntimes.com/2020/3/26/21196297/jails-and-prisons-could-become-coronavirus-disaster", 
                                                                     "https://www.pressconnects.com/story/news/local/2017/07/28/cornell-study-sheds-light-students-incarcerated-parents/512160001/", 
                                                                     "https://www.newsobserver.com/news/local/article247133959.html"
                                    )), row.names = c(9L, 12L, 33L), class = "data.frame")

I've tried this to no avail.我试过这个没有用。

Attempt at modifying code :尝试修改代码

article <- test %>% mutate( art = purrr::map(read_html(Direct.Link)))
#Error: Problem with `mutate()` column `art`.
#i `art = purrr::map(read_html(Direct.Link))`.
#x argument ".f" is missing, with no default

Ultimately I'd like to get a dataset that looks like this:最终我想得到一个如下所示的数据集:

Ideal Data:理想数据:

test2 <- structure(list(Participant.Name = c("Participant 1", "Participant 1", 
                                            "Participant 2"), Direct.Link = c("https://chicago.suntimes.com/2020/3/26/21196297/jails-and-prisons-could-become-coronavirus-disaster", 
                                                                              "https://www.pressconnects.com/story/news/local/2017/07/28/cornell-study-sheds-light-students-incarcerated-parents/512160001/", 
                                                                              "https://www.newsobserver.com/news/local/article247133959.html"),
                        title= c("Title of News Article 1", "Date of News Article 2", "Date of News Article 3"), 
                        content = c("Content of News Article 1", "Content of News Article 2", "Content of News Article 3" ),
                        time = c("Date of News Article 1","Date of News Article 2","Date of News Article 3"
                        )), row.names = c(9L, 12L, 33L), class = "data.frame")

Thanks for any help offered!感谢您提供的任何帮助!

It looks like map doesn't like the way you're providing the functions within it.看起来map不喜欢您在其中提供功能的方式。 map is really only designed to operate on a single function at a time, so you could either define a new function (eg getTitle ) that performs the three steps of getting the article, getting the node, and then getting the title, or you can break it up into multiple calls to mutate . map实际上只是设计为一次操作一个函数,因此您可以定义一个新函数(例如getTitle )来执行获取文章、获取节点和获取标题的三个步骤,或者您可以将其分解为多次调用mutate The piping system you've got there doesn't really work neatly with purrr's intended use.你在那里的管道系统并没有真正符合 purrr 的预期用途。

Here's an example of the split call to mutate.这是对 mutate 进行拆分调用的示例。 Note that each time I call map , I'm only providing a single function rather than a chain:请注意,每次我调用map ,我只提供一个函数而不是一个链:

articles <- test %>% 
  slice(1:2) %>% #The third article was causing rvest to hang
  mutate(art = map(.x = Direct.Link, .f = read_html)) %>%
  mutate(title = map(art, html_node, "title")) %>% 
  mutate(title = map_chr(title, html_text)) %>%
  select(-art) # Drop the article external pointer so it can be printed neatly

And here's an example of the custom function:这是自定义函数的示例:

getTitle <- function(art_url){
  art_url %>% read_html() %>% html_node("title") %>% html_text()
}
articles <- test %>% 
  slice(1:2) %>%
  mutate(title=map(Direct.Link, getTitle))

Both of the above return the output below:以上两者都返回以下输出:

在此处输入图片说明

This can then be expanded/repeated to extract the other nodes of interest.然后可以扩展/重复以提取其他感兴趣的节点。 Note that the first approach is much more polite web behavior because you're only requesting the article once, rather than each time, but I thought it would still be helpful to illustrate how map works with it.请注意,第一种方法是更礼貌的网络行为,因为您只请求文章一次,而不是每次都请求,但我认为说明map如何使用它仍然会有所帮助。 I'll also give a shoutout to the polite package since you're webscraping, potentially repeatedly.我也会对礼貌包大喊大叫,因为你在爬网,可能会重复。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM