[英]Subsetting elements in a list and placing them in a data frame
我有一个看起来像这样的列表(“listanswer”):
> str(listanswer)
List of 100
$ : chr [1:3] "" "" "\t\t"
$ : chr [1:5] "" "Dr. Smith" "123 Fake Street" "New York, ZIPCODE 1" ...
$ : chr [1:5] "" "Dr. Jones" "124 Fake Street" "New York, ZIPCODE 2" ...
> listanswer
[[1]]
[1] "" "" "\t\t"
[[2]]
[1] "" "Dr. Smith" "123 Fake Street" "New York"
[5] "ZIPCODE 1"
[[3]]
[1] "" "Dr. Jones" "124 Fake Street," "New York"
[5] "ZIPCODE2"
对于此列表中的每个元素,我注意到子元素中的以下模式:
# first sub-element is always empty
> listanswer[[2]][[1]]
[1] ""
# second sub-element is the name
> listanswer[[2]][[2]]
[1] "Dr. Smith"
# third sub-element is always the address
> listanswer[[2]][[3]]
[1] "123 Fake Street"
# fourth sub-element is always the city
> listanswer[[2]][[4]]
[1] "New York"
# fifth sub-element is always the ZIP
> listanswer[[2]][[5]]
[1] "ZIPCODE 1"
我想创建一个数据框,其中包含此列表中的行格式信息。 例如:
id name address city ZIP
1 2 Dr. Smith 123 Fake Street New York ZIPCODE 1
2 3 Dr. Jones 124 Fake Street New York ZIPCODE 2
我想到了以下方法来做到这一点:
name = sapply(listanswer,function(x) x[2])
address = sapply(listanswer,function(x) x[3])
city = sapply(listanswer,function(x) x[4])
zip = sapply(listanswer,function(x) x[5])
final_data = data.frame(name, address, city, zip)
id = 1:nrow(final_data)
我的问题:我只是想确认一下 - 这是在列表中引用子元素的正确方法吗?
如果它有效,这是正确的方法,尽管可能有更有效或更易读的方法来做同样的事情。
另一种方法是使用您的列创建一个数据框,并向其中添加行。 IE
#create an empty data frame
df <- data.frame(matrix(ncol = 4, nrow = 0))
colnames(df) <- c("name", "address", "city", "zip")
#add rows
lapply(listanswer, \(x){df[nrow(df) + 1,] <- x[2:5]})
这只是解决同一问题的另一种方法。 可读性是个人喜好,您的解决方案也没有错。
如果这是基于您的大象问题,对于温哥华的企业,那么这主要是可行的。
library(rvest)
url<-"Website/british-columbia/"
page <-read_html(url)
#find the div tab of class=one_third
b = page %>% html_nodes("div.one_third")
listanswer <- b %>% html_text() %>% strsplit("\\n")
#listanswer2 <- b %>% html_text2() %>% strsplit("\\n")
listanswer[[1]]<-NULL #remove first blank record
rows<-lapply(listanswer, function(element){
vect<-element[-1] #remove first blank field
cityindex<-as.integer(grep("Vancouver", vect)) #find city field
#add some error checking and corrections
if(length(cityindex)==0) {
cityindex <- length(vect)-1 }
else if(length(cityindex)>1) {
cityindex <- cityindex[2] }
#get the fields of interest
address <- vect[cityindex-1]
city<-vect[cityindex]
phone <- vect[cityindex+1]
if( cityindex < 3) {
cityindex <- 3
} #error check
#first groups combine into 1 name
name <- toString(vect[1:(cityindex-2)])
data.frame(name, address, city, phone)
})
answer<-bind_rows(rows)
#clean up
answer$phone <- sub("Website", "", answer$phone)
answer
这仍然需要一些清理来处理不一致,但应该完成 80-90%
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.