简体   繁体   中英

Quickly Categorizing Character Vector in R

I have a dataset with a column of messy character data. I'd like to convert it to factorial data for analysis.

carData <- data.frame(car=c("Mustang", "Toyota Tercel", "M3", "Datsun 240Z", "Chevy Malibu"), 
                 year=c("2001", "1994", "2004", "1980", "2000"))

            car year
1       Mustang 2001
2 Toyota Tercel 1994
3            M3 2004
4   Datsun 240Z 1980
5  Chevy Malibu 2000

I've created a couple of lists to aid with this, one with a list of search strings, and another with the associated categories.

cars <- c("Mustang", "Toyota", "M3", "Chevy")
make <- c("Ford", "Toyota", "BMW", "Chevrolet")

My intent is to loop over the list and assign the category in a new variable.

categorize <- function(df, searchString, category) {
  df$make <- "OTHER"
  for(i in seq(1, length(searchString), 1)) {
    list <- grep(searchString[i], df[,1], ignore.case=TRUE)
    if (length(list) > 0) {
      for(j in seq(1, length(list), 1)) {
        df$make[list[j]] <- category[i]
      }
    }
  }
  df
}

cleanCarData <- categorize(carData, cars, make)

Output is:

            car year      make
1       Mustang 2001      Ford
2 Toyota Tercel 1994    Toyota
3            M3 2004       BMW
4   Datsun 240Z 1980     OTHER
5  Chevy Malibu 2000 Chevorlet

My code works, my issue is that my data has ~1M rows and it takes ~3 hours to complete. Conversely, if I create a lined statement for each, it takes less than 3 minutes to complete all of them.

df$make <- "OTHER"
df$make[grep("Mustang", df$car, ignore.case=TRUE)] <- "Ford"
df$make[grep...]

I have 50 search strings so far and could easily have 100 more as I work my way through the data. Is there a good compromise between maintainable code and performance?

You can make things better by eliminating the inner loop

categorize <- function(df, searchString, category) {
  df$make <- "OTHER"
  for(i in seq_along(searchString)) {
    list <- grep(searchString[i], df[,1], ignore.case=TRUE)
    if (length(list) > 0) {
      df$make[list] <- category[i]
    }
  }
  df
}

This is hard to test at scale to see if that'a where most of your time is spent because your sample data isn't very large.

This is a possibility:

cleanCarData = carData
for(k in 1:length(cars)) {
    sel=grep(cars[k], as.character(cleanCarData$car))
    cleanCarData[sel,"make"] = make[k]
}
cleanCarData$make[is.na(cleanCarData$make)] = "OTHER"

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM