简体   繁体   English

在 R 中,在字符串的循环中编写循环的最佳方法是什么?

[英]In R, what is the best way to write loop within a loop for a character string?

Looking into a web crawler that will go through multiple indeed.com country URLs查看 web 爬虫,该爬虫将通过多个 go 确实.com 国家 URL

I have the first part of the code that crawls through individual pages below:我有通过以下各个页面爬行的代码的第一部分:

library(tidyverse)
library(rvest)
library(xml2)
library(dplyr)
library(stringr)

listings<- data.frame(title=character(),
                      company=character(),
                      stringAsFactors = FALSE)

for(i in seq(0,500,10)){
   url_ds<-paste0('https://www.indeed.com/jobs?q=data+analyst&l=&radius=25&start=',i)
   var <-read_html(url_ds)

#job title
title<- var %>%
    html_nodes('#resultsCol .jobtitle') %>%
    html_text() %>%
    str_extract("(//w+,+)+")

#company
    company<- var %>%
    html_nodes('#resultsCol .company') %>%
    html_text() %>%
    str_extract("(//w+,+)+")

 listings<-rbind(listings, as.data.frame(cbind(company,
                                          title)))
 }

What I would like to do is also loop through an array of the different country urls at the beginning of the "url_ds" above using a url_basic_list below and add a column for the actual country.我还想做的是使用下面的 url_basic_list 循环遍历上面“url_ds”开头的不同国家/地区网址的数组,并为实际国家/地区添加一列。 basically I would need to create a loop within a loop for a text string, what is the best way to do so?基本上我需要在循环中为文本字符串创建一个循环,最好的方法是什么?

url_basic_list<-
     c("http://www.indeed.com",
     "http://www.indeed.com.hk",
     "http://www.indeed.com.sg"
     )

country<-
     c("USA",
     "Hong Kong",
     "Singapore"
     )

Two suggestions:两个建议:

  • change your for loop to lapply ;将您的for循环更改为lapply this is mostly because iteratively adding rows to a data.frame starts out okay but gets slower and more memory-intensive with each pass through the loop.这主要是因为迭代地将行添加到data.frame开始没问题,但每次通过循环时会变得更慢且内存更密集。 (For each rbind , it has to copy all of the contents in memory, so your memory needs are at least double the size of the frame.) By using lapply , it creates a list of data.frame s (read the link,), which is created and filled memory-efficiently (as much as R can do), and then we do a single rbind at the end on the whole dataset. (对于每个rbind ,它必须复制 memory 中的所有内容,因此您的 memory 需要至少是帧大小的两倍。)通过使用lapply ,它创建了一个data.frame的列表(阅读链接,) ,它是高效地创建和填充内存的(就像 R 可以做的那样),然后我们在整个数据集的最后执行一个rbind

  • functionize this, and call the country code ( cc ) as a function argument.将此函数化,并将国家代码 ( cc ) 称为 function 参数。

get_indeed <- function(cc = "") {
  dotcc <- if (cc == "us") "" else paste0(".", cc)

  listings_list <- lapply(seq(0, 500, by = 10), function(i) {
    url_ds <- sprintf('https://www.indeed.com%s/jobs?q=data+analyst&l=&radius=25&start=%i', dotcc, i)
    var <- read_html(url_ds)

    #job title
    title <- var %>%
      html_nodes('#resultsCol .jobtitle') %>%
      html_text() %>%
      str_extract("(//w+,+)+")

    #company
    company <- var %>%
      html_nodes('#resultsCol .company') %>%
      html_text() %>%
      str_extract("(//w+,+)+")

    data.frame(company, title)
  })
  listings <- do.call(rbind, listings_list)
  listings$cc <- if (nzchar(cc)) cc else ""
  listings
}

From here, to "loop" through a series of countries, one might do从这里,“循环”通过一系列国家,一个人可能会做

all_countries <- lapply(c("us", "hk", "sg"), get_indeed)
all_countries <- do.call(rbind, all_countries)

From here, all of your $cc values will be the two-letter codes, which is fine.从这里开始,您所有的$cc值都将是两个字母的代码,这很好。 The bring in the full names, I suggest you have a simple data.frame to map one to the other:引入全名,我建议你有一个简单的data.frame到 map 一个到另一个:

countries <- data.frame(
  cc = c("us", "hk", "sg"),
  country = c("USA", "Hong Kong", "Singapore")
)
all_countries <- merge(all_countries, countries, by = "cc")

And your data will now have both $cc (two-letter) and $country (full words).您的数据现在将同时包含$cc (双字母)和$country (全字)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM