简体   繁体   English

错误:下标类型“列表”无效(网络抓取)

[英]Error: invalid subscript type 'list' (Webscraping)

I am trying to web-scrape data from the following url-: https://university.careers360.com/colleges/list-of-degree-colleges-in-India I want to click on each college name and get particular data for each college. 我正在尝试从以下url-中进行数据爬网: https : //university.careers360.com/colleges/list-of-degree-colleges-in-India我想单击每个大学名称并获取特定的数据每个学院。

First what I did was to collect all the college urls in a vector-: 首先,我要做的是将所有大学网址收集在一个向量中:

#loading the package:
library(xml2)
library(rvest)
library(stringr)
library(dplyr)

#Specifying the url for desired website to be scrapped
baseurl <- "https://university.careers360.com/colleges/list-of-degree-colleges-in-India"

#Reading the html content from Amazon
basewebpage <- read_html(baseurl)

#Extracting college name and its url
scraplinks <- function(url){
   #Create an html document from the url
   webpage <- xml2::read_html(url)
   #Extract the URLs
   url_ <- webpage %>%
   rvest::html_nodes(".title a") %>%
   rvest::html_attr("href")  
   #Extract the link text
   link_ <- webpage %>%
   rvest::html_nodes(".title a") %>%
   rvest::html_text()
   return(data_frame(link = link_, url = url_))
}

#College names and Urls
allcollegeurls<-scraplinks(baseurl)

#Reading the each url
library(purrr)    
allreadurls<-map(allcollegeurls$url, read_html)

Working fine uptill now, but when I write following code, it is showing an error. 现在可以正常工作,但是当我编写以下代码时,它显示了错误。

#Specialization
#Using CSS selectors to scrap the specialization section
allcollegeurls$Specialization<-NA
for (i in allreadurls) {
  allcollegeurls$Specialization[i] <- html_nodes(allreadurls[i][],'td:nth- 
  child(1)')
}

Error in allreadurls[i] : invalid subscript type 'list'

I'm not sure about the scraped content itself, but you may want to replace the loop by 我不确定所抓取的内容本身,但您可能希望将循环替换为

for (i in 1:length(allreadurls)) {
  allcollegeurls$Specialization[i] <- html_nodes(allreadurls[i][],'td:nth-child(1)')
}

One problem with your approach was the inconsistency of the role of i : it was taking values in allreadurls but then used to subset Specialization and allreadurls . 您的方法存在的一个问题是i的角色不一致:它在allreadurls中获取值,但随后allreadurls Specializationallreadurls用作子集。 Another problem was all the extra spaces in 另一个问题是所有多余的空间

'td:nth- 
  child(1)'

Lastly, since allreadurls is a list, you want to subset it with [[i]] , not [i] (which again returns a list). 最后,由于allreadurls是一个列表,因此您想使用[[i]]而不是[i] (它再次返回一个列表)来对其进行子集化。 Lastly, there's no need for [] . 最后,不需要[]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM