I have a dataset with a column of messy character data. I'd like to convert it to factorial data for analysis.
carData <- data.frame(car=c("Mustang", "Toyota Tercel", "M3", "Datsun 240Z", "Chevy Malibu"),
year=c("2001", "1994", "2004", "1980", "2000"))
car year
1 Mustang 2001
2 Toyota Tercel 1994
3 M3 2004
4 Datsun 240Z 1980
5 Chevy Malibu 2000
I've created a couple of lists to aid with this, one with a list of search strings, and another with the associated categories.
cars <- c("Mustang", "Toyota", "M3", "Chevy")
make <- c("Ford", "Toyota", "BMW", "Chevrolet")
My intent is to loop over the list and assign the category in a new variable.
categorize <- function(df, searchString, category) {
df$make <- "OTHER"
for(i in seq(1, length(searchString), 1)) {
list <- grep(searchString[i], df[,1], ignore.case=TRUE)
if (length(list) > 0) {
for(j in seq(1, length(list), 1)) {
df$make[list[j]] <- category[i]
}
}
}
df
}
cleanCarData <- categorize(carData, cars, make)
Output is:
car year make
1 Mustang 2001 Ford
2 Toyota Tercel 1994 Toyota
3 M3 2004 BMW
4 Datsun 240Z 1980 OTHER
5 Chevy Malibu 2000 Chevorlet
My code works, my issue is that my data has ~1M rows and it takes ~3 hours to complete. Conversely, if I create a lined statement for each, it takes less than 3 minutes to complete all of them.
df$make <- "OTHER"
df$make[grep("Mustang", df$car, ignore.case=TRUE)] <- "Ford"
df$make[grep...]
I have 50 search strings so far and could easily have 100 more as I work my way through the data. Is there a good compromise between maintainable code and performance?
You can make things better by eliminating the inner loop
categorize <- function(df, searchString, category) {
df$make <- "OTHER"
for(i in seq_along(searchString)) {
list <- grep(searchString[i], df[,1], ignore.case=TRUE)
if (length(list) > 0) {
df$make[list] <- category[i]
}
}
df
}
This is hard to test at scale to see if that'a where most of your time is spent because your sample data isn't very large.
This is a possibility:
cleanCarData = carData
for(k in 1:length(cars)) {
sel=grep(cars[k], as.character(cleanCarData$car))
cleanCarData[sel,"make"] = make[k]
}
cleanCarData$make[is.na(cleanCarData$make)] = "OTHER"
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.