简体   繁体   中英

How do I add a loop when using R to scrape data?

I'm trying to create a database of crime data by zip code based on Trulia.com's data. I have the code below but so far it only produces 1 line of data. In the code below, Zipcodes is just a list of US zip codes. Can anyone tell me what I need to add to make this run through my entire list "i" ?

Here is a link to one of the Trulia pages for reference: https://www.trulia.com/real_estate/20004-Washington/crime/

UPDATE: Here are zip codes for download: https://www.dropbox.com/s/uxukqpu0v88d7tf/Zip%20Code%20Database%20wo%20Boston.xlsx?dl=0

I also changed the code a bit this time after realizing the crime stats appear in different orders depending on the zip code. Is it possible to have the loop produce 4 lines per zipcode? This currently works but only produces the last zip code in the dataset. I can't figure out how to make sure each zip code's data is recorded on separate lines, so it doesn't overwrite and only leave one line of the last zip code.

Please help!!

 library(rvest)

 data=data.frame(Zipcodes)
 for(i in data$Zip.Code)
 {  
 site <- paste("https://www.trulia.com/real_estate/",i,"-Boston/crime/", sep="")
 site <- html(site)

 crime<- data.frame(zip =i,
        type =site %>% html_nodes(".brs") %>% html_text() ,
        stringsAsFactors=FALSE)
}
View(crime)

If that code doesn't work, try this:

data=data.frame(Zillow_Data_for_R_Test)
for(i in data$Zip.Code)
site <- paste("https://www.trulia.com/real_estate/",i,"-Boston/crime/", sep="")
site <- read_html(site)
crime<- data.frame(zip =i,
         theft =site %>% html_nodes(".crime-text-0") %>% html_text() ,
         assault =site %>% html_nodes(".crime-text-1") %>% html_text() ,
         arrest =site %>% html_nodes(".crime-text-2") %>% html_text() ,
         vandalism =site %>% html_nodes(".crime-text-3") %>% html_text() ,
         robbery =site %>% html_nodes(".crime-text-4") %>% html_text() ,
         type =site %>% html_nodes(".clearfix") %>% html_text() ,
         stringsAsFactors=FALSE)
View(crime)

The comment of @r2evans already provides an answer. Since the @ShanCham asked how to actually implement this I wanted to guide with the following code, which is just more verbose than the comment and could therefore not be posted as additional comment.

library(rvest)

#only two exemplary zipcodes, could be more, of course
zipcodes <- c("02110", "02125")

crime <- lapply(zipcodes, function(z) {

  site <- read_html(paste0("https://www.trulia.com/real_estate/",z,"-Boston/crime/"))

           #for illustrative purposes:
           #introduced as.numeric to numeric columns
           #exluded some of your other columns and shortenend the current text in type
           data.frame(zip = z,
                      theft = site %>% html_nodes(".crime-text-0") %>% html_text() %>% as.numeric(),
                      assault = site %>% html_nodes(".crime-text-1") %>% html_text() %>% as.numeric() ,
                      type = site %>% html_nodes(".clearfix") %>% html_text() %>% paste(collapse = " ") %>% substr(1, 50) ,
                      stringsAsFactors=FALSE)
})

class(crime)
#list

#Output are lists that can be bound together to one data.frame
crime <- do.call(rbind, crime)

#crime is a data.frame, hence, classes/types are kept
class(crime$type)
# [1] "character"
class(crime$assault)
# [1] "numeric"

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM