简体   繁体   English

这个 for 循环中的“i”/类别应该是什么,我如何确保它在我的工作目录中?

[英]What should the “i”/category be in this for loop and how can I ensure it is in my working directory?

I am running a web-scraping project and running into some difficulty using the urls for search results from an initial scrape to scrape information from the search results themselves.我正在运行一个网络抓取项目,并且在使用初始抓取的搜索结果的 URL 来从搜索结果本身抓取信息时遇到了一些困难。

My first loop provides the back halves of the urls I need, after the / (for example, yelp.com/abd - I have abd), which I have in a nested list.我的第一个循环提供了我需要的 URL 的后半部分,在 /(例如,yelp.com/abd - 我有 abd)之后,我在嵌套列表中。 However, when I summarize that nested list, like so:但是,当我总结该嵌套列表时,如下所示:

profile_url_lst <- list()
for(page_num in 1:73){
  main_url <- paste0("https://www.theeroticreview.com/reviews/newreviewsList.asp?searchreview=1&gCity=region1%2Dus%2Drhode%2Disland&gCityName=Rhode+Island+%28State%29&SortBy=3&gDistance=0&page=", page_num)
  html_content <- read_html(main_url)
  profile_urls <- html_content %>% html_nodes("body")%>% html_children() %>% html_children() %>% .[2] %>% html_children() %>% 
    html_children() %>% .[3] %>% html_children() %>% .[4] %>% html_children() %>% html_children() %>% html_children() %>% 
    html_attr("href")
  
  profile_url_lst[[page_num]] <- profile_urls
Sys.sleep(2)
}
profile_url_lst
profiles <- cbind(profile_urls)
profiles

I only receive the urls from the last page of results.我只收到来自结果最后一页的网址。

I pasted the domain name to those urls with paste0, which worked fine, but I then encounter another problem.我用 paste0 将域名粘贴到那些 url,效果很好,但我遇到了另一个问题。 When I use the variable name in a for loop, R returns "variable name is not in your working directory).当我在 for 循环中使用变量名时,R 返回“变量名不在您的工作目录中)。

complete_urls <- paste0('https://www.theeroticreview.com', profiles)
complete <- cbind(complete_urls)
complete
TED_lst <- list()
for(complete_urls in 1:73) {
  html_content1 <- read_html('complete_urls')
  TED <- html_content1 %>% html_nodes("'") %>% html_text()
  TED_lst[i] <- TEDs
Sys.sleep(2)

How do I paste the domain name to all the collected urls and bind them, and what should the category be in the for loop?如何将域名粘贴到所有收集的url并绑定它们,for循环中的类别应该是什么?

Assuming you intend to read_html from each url within complete_urls you want to avoid overwriting that variable by using it as the loop variable;假设您打算从 complete_urls 中的每个 url 中读取_html,您希望通过将其用作循环变量来避免覆盖该变量; as well as referencing it as a string literal.以及将其引用为字符串文字。 You could instead seq_along the items and index in. Here I print rather than read_html您可以改为 seq_along 项目和索引。这里我打印而不是 read_html

complete_urls <- c('A', 'B')

for(i in seq_along(complete_urls)){
  print(complete_urls[[i]])
}

It is probably better to write a custom function to apply to each url and pass that into a tidyverse function/possibly something where you can take advantage of parallel|async running.最好编写一个自定义 function 以应用于每个 url 并将其传递给 tidyverse 函数/可能是您可以利用并行|异步运行的东西。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM