简体   繁体   中英

Creating a new column using data from an existing column in R

I have administrative data that was entered by multiple people and as a result many of the listings have been entered incorrectly/in different formats or spelled incorrectly. For example, I should have all Fords listed as 'Ford' but instead I have entries such as 'Ford', 'Ford Taurus', 'ford f150', '1980 capri classic', etc.

I am trying to create a new column that lists all of the car makes in one format (eg all of the above Ford listings would just come up as 'Ford'). I have tried searching for an answer to this but nothing seems to work.

For example: (EquipMake is the column with the original data and New_Make is the column I want to create)

**EquipMake**                 **New_Make**
1980 Capri Classic            Ford
Camry                         Toyota
NISON                         Nissan
ford                          Ford 
Mitsubishi Eclipse Con        Mitsubishi
Cadilac  Seville              Cadillac
Dodge Caravan                 Dodge
1987 Ford                     Ford
Honda Accord                  Honda
poss / pontiac                Unknown
Oldsmobile Cutless Cie        Oldsmobile
bmw                           BMW

The below code is the closest I have come to getting it to work but it only works for some of the entries and I can't figure out why...

mydata[grep("?|???|N/A|NA|NONE PROVIDED|NONE GIVEN|NOT PROVIDED|U/K|UNKNOWN|UNKNOWJN|UNLISTED|poss / pontiac|possible cutlass|possibly Honda|UNAVAILABLE|unlisted", mydata$EquipMake), "New_Make"] <- "Unknown"
mydata[grep("021CVG|1993|1B3BP44KLYN100171|20 FOOT|3 WHEELER|301 DORSEY|AREIE K CAR|BLACKWOOD HODGE|BLUE BIRD|BOLER|CAVCO|CLAYNOR TRAILER SALES|CMC|COMFORT|CRAFTSMAN|CUSTOM BUILT|DIESEL|FIFTH WHEEL|GARAGE TRUCK|GILLNETTER|GRUMAN|GRUHMANN|GULF STREAM|HONDAY|INTERNATIONAL|INTERNATIONAL - EAGLE|K-CAR|KING OF THE ROAD|KOBELCO|MCI|MIDAS  CHATEAU|NATIONAL|OKANAGAN|ORCA|ORD|PHMAN|PICKUP|SCOOTER|SEDAN|TORO|TRAILER|UTILITY TRAILER|WABASH|WILDERNESS|AMER|AMER. MOTORS|DAMON|DAMON CORP|FREIGHT TRAILER", mydata$EquipMake), "New_Make"] <-"Other"
mydata[grep("1979 DRUMMOND", mydata$EquipMake), "New_Make"] <- "Drummond"
mydata[grep("APPEARED TO BE A FORD|CORSAIR|FORD|FORC F-150|FORD  AEROSTAR|FORD ?|FORD 150 XLT   LIGHT B|FORD E-350|FORD EXPLORER|FORD EXPLORER?|FORD F-150|FORD F-350|FORD F150|FORD F350|FORD MUSTANG|FORD MUSTANG 07|FORD PROBE|FORD TAURUS|FORD TAURUS LICENSE 0|FORD TEMPO|FORD THUNDERBIRD|FORD TRUCK|FORD,|1980 CAPRI CLASSIC|MUSTANG|TEMPO|THUNDERBIRD|TRANSIT|WHITE 1991 FORD TEMPO", mydata$EquipMake), "New_Make"] <- "Ford"
mydata[grep("1989 HONDA ACCORD|CIVIC|HOJNDA|HONDA|HONDA   ACCORD|HONDA  (CIVIC?)|HONDA  ACCORD|HONDA  CIVIC|HONDA  CIVIC 4 DR.  1|HONDA PRELUDE|HONDA SUV|HONDA?", mydata$EquipMake), "New_Make"] <- "Honda"
mydata[grep("1992 volkswagen|2000 Jetta|jetta|passat|V.W.|volkawagen|volkawagon|VOLKS|volkswagan|volkswagen|volkswagon|Volkswagon  Golf|volkswagon passat|Volkwagon|vw|VW camperized van|vw jetta|VW Passat", mydata$EquipMake), "New_Make"] <- "Volkswagon"
mydata[grep("1993 Buick Regal|buick|buick acheiva|buick alero|buick regal|buiick riveria|alero", mydata$EquipMake), "New_Make"] <- "Buick"
mydata[grep("1994 GMC Extra can lon|G.M.C.|GMC|GMC - Sierra|GMC  Jimmy|GMC  Tracker|GMC 3500|Gmc 3500 Truck|gmc 4x4  stolen bc  77|GMC Discovery|GMC Jimmy|GMC SAFARI|gmc sierra truck|GMC van|gmc vandura|GMC Vanguard|GMC/Chevrolet|vanguard", mydata$EquipMake), "New_Make"] <- "GMC"
mydata[grep("2005 audi a4 1.8 l|audi", mydata$EquipMake), "New_Make"] <- "Audi"

(These are just the first few rows of code - there are 90 all up)

When I look at the output there should be 90 different makes in the new column but only 21 have worked (the rest are coming up as "Unknown"). In the above code the Drummond, Audi and Buick ones did not work.

Is anyone able to tell me why this isn't working? Or alternatively, point me in the direction of something that will work?

I am fairly new at using R so the simpler the explanation the better :)

Thank you!

You should think about the format you want for your data. It looks like you're taking every input value and writing down a make that corresponds to it and you want to look for all occurences of EquipMake and assign the appropriate value of New_Make to it. As revans pointed out in comments, there are alternative ways to approach the problem. But if you're going to take this approach, there's a far easier method than trying to grep every value. Create a tidy dataset containing two columns (EquipMake and New_Make) and one row per every value of EquipMake that you want to recode. Then join that dataset to your main data through the left_join function from dplyr (which is part of the tidyverse package).

library(tidyverse) # Should be part of all data science workflows

###############################
# Generate data
grep_data <- c("?|???|N/A|NA|NONE PROVIDED|NONE GIVEN|NOT PROVIDED|U/K|UNKNOWN|UNKNOWJN|UNLISTED|poss / pontiac|possible cutlass|possibly Honda|UNAVAILABLE|unlisted",
               "021CVG|1993|1B3BP44KLYN100171|20 FOOT|3 WHEELER|301 DORSEY|AREIE K CAR|BLACKWOOD HODGE|BLUE BIRD|BOLER|CAVCO|CLAYNOR TRAILER SALES|CMC|COMFORT|CRAFTSMAN|CUSTOM BUILT|DIESEL|FIFTH WHEEL|GARAGE TRUCK|GILLNETTER|GRUMAN|GRUHMANN|GULF STREAM|HONDAY|INTERNATIONAL|INTERNATIONAL - EAGLE|K-CAR|KING OF THE ROAD|KOBELCO|MCI|MIDAS  CHATEAU|NATIONAL|OKANAGAN|ORCA|ORD|PHMAN|PICKUP|SCOOTER|SEDAN|TORO|TRAILER|UTILITY TRAILER|WABASH|WILDERNESS|AMER|AMER. MOTORS|DAMON|DAMON CORP|FREIGHT TRAILER",
               "1979 DRUMMOND",
               "APPEARED TO BE A FORD|CORSAIR|FORD|FORC F-150|FORD  AEROSTAR|FORD ?|FORD 150 XLT   LIGHT B|FORD E-350|FORD EXPLORER|FORD EXPLORER?|FORD F-150|FORD F-350|FORD F150|FORD F350|FORD MUSTANG|FORD MUSTANG 07|FORD PROBE|FORD TAURUS|FORD TAURUS LICENSE 0|FORD TEMPO|FORD THUNDERBIRD|FORD TRUCK|FORD,|1980 CAPRI CLASSIC|MUSTANG|TEMPO|THUNDERBIRD|TRANSIT|WHITE 1991 FORD TEMPO",
               "1989 HONDA ACCORD|CIVIC|HOJNDA|HONDA|HONDA   ACCORD|HONDA  (CIVIC?)|HONDA  ACCORD|HONDA  CIVIC|HONDA  CIVIC 4 DR.  1|HONDA PRELUDE|HONDA SUV|HONDA?",
               "1992 volkswagen|2000 Jetta|jetta|passat|V.W.|volkawagen|volkawagon|VOLKS|volkswagan|volkswagen|volkswagon|Volkswagon  Golf|volkswagon passat|Volkwagon|vw|VW camperized van|vw jetta|VW Passat",
               "1993 Buick Regal|buick|buick acheiva|buick alero|buick regal|buiick riveria|alero",
               "1994 GMC Extra can lon|G.M.C.|GMC|GMC - Sierra|GMC  Jimmy|GMC  Tracker|GMC 3500|Gmc 3500 Truck|gmc 4x4  stolen bc  77|GMC Discovery|GMC Jimmy|GMC SAFARI|gmc sierra truck|GMC van|gmc vandura|GMC Vanguard|GMC/Chevrolet|vanguard",
               "2005 audi a4 1.8 l|audi")

make_data <- c("Unknown", "Other", "Drummond", "Ford", "Honda", "Volkswagen", "Buick", "GMC", "Audi")

raw_reference <- tibble(grep_data, make_data)

make_replacement_table <- function(namestring) {
  strsplit(namestring[1],
    split = "|",
    fixed = TRUE
  ) %>% unlist %>%
    tibble(., namestring[2]) %>%
    set_names(c("EquipMake", "New_Make"))
}

# Generate a dataset that has both known and unknown values for EquipMake
mydata <- sample(reference_table$EquipMake, size = 1000, replace = TRUE) %>%
  tbl_df %>%
  set_names("EquipMake")

###############################
# The answer to your question

# Create the lookup table containing original and replacement values
# You could create the table in Excel and import with readr::read_csv()
reference_table <- apply(raw_reference, 1, make_replacement_table) %>%
  do.call(rbind.data.frame, .)

# Now join reference_table against your raw data
# Any values of EquipMake you haven't coded will be NA
mydata <- mydata %>%
  left_join(reference_table)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM