简体   繁体   English

如何在 r 中使用 for 循环进行网页抓取

[英]How to use a for loop for webscraping in r

I have a df with two columns: id and url.我有一个包含两列的 df:id 和 url。 id contains project ids, and url contains website links which I would like to use for scraping ids of parent projects. id 包含项目 id,url 包含我想用于抓取父项目的 id 的网站链接。 Here is a sample of df that I have:这是我拥有的 df 示例:

Here is a sample df:这是一个示例df:

df <- structure(list(id = c("P173165", "P175875", "P175841", "P175730"
), url = c("https://search.worldbank.org/api/v2/projects?format=json&fl=*&id=P173165&apilang=en", 
"https://search.worldbank.org/api/v2/projects?format=json&fl=*&id=P175875&apilang=en", 
"https://search.worldbank.org/api/v2/projects?format=json&fl=*&id=P175841&apilang=en", 
"https://search.worldbank.org/api/v2/projects?format=json&fl=*&id=P175730&apilang=en"
)), row.names = c(NA, -4L), class = c("data.table", "data.frame"))

> df
        id                                                                                 url
1: P173165 https://search.worldbank.org/api/v2/projects?format=json&fl=*&id=P173165&apilang=en
2: P175875 https://search.worldbank.org/api/v2/projects?format=json&fl=*&id=P175875&apilang=en
3: P175841 https://search.worldbank.org/api/v2/projects?format=json&fl=*&id=P175841&apilang=en
4: P175730 https://search.worldbank.org/api/v2/projects?format=json&fl=*&id=P175730&apilang=en

I was suggested by @Sirius that I can scrape parent project ids by using the following code: @Sirius 建议我可以使用以下代码刮取父项目 ID:

library(jsonlite)

#let's do an example for row 1

json_data <- fromJSON("https://search.worldbank.org/api/v2/projects?format=json&fl=*&id=P173165&apilang=en")
json_data$projects[["P173165"]]$parentprojid

As you see, I input the url from the first row;如您所见,我从第一行输入了 url; and then I input the id from the first row.然后我从第一行输入 id。 This code outputs a parent project id:此代码输出父项目 ID:

[1] "P147665"

I want to write a code that would automatise this process, and would create a vector that would contain the parent projects' ids.我想编写一个代码来自动化这个过程,并创建一个包含父项目ID的向量。 I would then assign this vector as a third column to my df.然后我会将此向量作为第三列分配给我的 df。 This is what I want to achieve:这就是我想要实现的目标:

        id                                                                                 url par_proj_id
1: P173165 https://search.worldbank.org/api/v2/projects?format=json&fl=*&id=P173165&apilang=en     P147665
2: P175875 https://search.worldbank.org/api/v2/projects?format=json&fl=*&id=P175875&apilang=en     P173883
3: P175841 https://search.worldbank.org/api/v2/projects?format=json&fl=*&id=P175841&apilang=en     P170267
4: P175730 https://search.worldbank.org/api/v2/projects?format=json&fl=*&id=P175730&apilang=en     P173799

I guess I should be using a for loop here, but I'm not sure how.我想我应该在这里使用 for 循环,但我不确定如何。 Any ideas?有任何想法吗? I'd appreciate any help a lot.我会非常感谢任何帮助。

You can put the request into a function and then use map2 from purrr to pass in the child id and url.您可以将请求放入 function 中,然后使用 purrr 中的 map2 传入子 ID 和 url。 This should be more efficient, and r'esque, than using a for loop.这应该比使用 for 循环更有效,也更时髦。

library(magrittr)
library(jsonlite)
library(purrr)

get_parent_id <- function(child_id, url){
  json_data <- jsonlite::fromJSON(url)
  return(json_data$projects[[child_id]]$parentprojid)
}


df <- structure(list(id = c("P173165", "P175875", "P175841", "P175730"
), url = c("https://search.worldbank.org/api/v2/projects?format=json&fl=*&id=P173165&apilang=en", 
           "https://search.worldbank.org/api/v2/projects?format=json&fl=*&id=P175875&apilang=en", 
           "https://search.worldbank.org/api/v2/projects?format=json&fl=*&id=P175841&apilang=en", 
           "https://search.worldbank.org/api/v2/projects?format=json&fl=*&id=P175730&apilang=en"
)), row.names = c(NA, -4L), class = c("data.table", "data.frame"))


df$par_proj_id <- purrr::map2(df$id, df$url, get_parent_id)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM