I have a dataframe with over 20k obs. One of the columns is "city names" (df$city). There are over 600 unique city names. Some of them are misspelled.
Example of my dataframe:
> df$city
[1] "BOSTN" "LOS ANGELOS" "NYC" "CHICAGOO"
[2] "SEATTLE" "BOSTON" "NEW YORK CITY"
I have a csv file I created that has a list of all the misspelled city names and what the correct name should be.
> head(city)
city city_incorrect
1 BOSTON BOSTN
2 LOS ANGELES LOS ANGELOS
3 NEW YORK CITY NYC
4 CHICAGO CHICAGOO
Ideally I would write code that replaces values in df$city based on the "city.csv" file.
Note: I originally posted this question and someone suggested I use merge, I don't think this is the most efficient way to solve my problem because I would also have to include the 600 correctly spelled cities in my "city.csv" file. OR I think I'd need an additional step that combines the two columns from the merge dataframe. So I think it's probably easier to just REPLACE values in df$city based on "city.csv".
EDIT: Here's a more detailed look at my dataframe
> df[1:5]
id owner city state
1 AAAAA BOSTN MA
2 BBBBB LOS ANGELOS CA
3 CCCCC NYC NY
4 DDDDD CHICAGOO IL
5 EEEEE BOSTON MA
6 FFFFF SEATTLE WA
7 GGGGG NEW YORK CITY NY
8 HHHHH LOS ANGELES CA
If I use merge or cbind won't it just create another column at the end of my dataframe like this:
> merge()
id owner city state city_correct
1 AAAAA BOSTN MA BOSTON
2 BBBBB LOS ANGELOS CA LOS ANGELES
3 CCCCC NYC NY NEW YORK CITY
4 DDDDD CHICAGOO IL CHICAGO
5 EEEEE BOSTON MA
6 FFFFF SEATTLE WA
7 GGGGG NEW YORK CITY NY
8 HHHHH LOS ANGELES CA
So the cities with misspelling will be corrected, but the cities that are spelled correctly will be left out. What I want in the end is one column that has all the corrected city names.
One approach with base::merge()
is to include rows in the lookup table that have the correct value of city, and merge that table with the original data. We'll call the "correct" city names correctedCity
, and merge as follows:
cityText <- "id,owner,city,state
1,AAAAA,BOSTN,MA
2,BBBBB,LOS ANGELOS,CA
3,CCCCC,NYC,NY
4,DDDDD,CHICAGOO,IL
5,EEEEE,BOSTON,MA
6,FFFFF,SEATTLE,WA
7,GGGGG,NEW YORK CITY,NY
8,HHHHH,LOS ANGELES,CA"
cities <- read.csv(text = cityText, header = TRUE, stringsAsFactors = FALSE)
# first, find all the distinct versions of city
library(sqldf)
distinctCities <- sqldf("select city, count(*) as count from cities group by city")
# create lookup table, and include rows for items that are already correct
tableText <- "city,correctedCity
BOSTN,BOSTON
BOSTON,BOSTON
CHICAGOO,CHIGAGO
LOS ANGELES,LOS ANGELES
LOS ANGELOS,LOS ANGELES
NEW YORK CITY,NEW YORK CITY
NYC,NEW YORK CITY
SEATTLE,SEATTLE"
cityTable <- read.csv(text = tableText,header = TRUE,stringsAsFactors = FALSE)
corrected <- merge(cities,cityTable,by = "city")
corrected
...and the output:
> corrected
city id owner state correctedCity
1 BOSTN 1 AAAAA MA BOSTON
2 BOSTON 5 EEEEE MA BOSTON
3 CHICAGOO 4 DDDDD IL CHIGAGO
4 LOS ANGELES 8 HHHHH CA LOS ANGELES
5 LOS ANGELOS 2 BBBBB CA LOS ANGELES
6 NEW YORK CITY 7 GGGGG NY NEW YORK CITY
7 NYC 3 CCCCC NY NEW YORK CITY
8 SEATTLE 6 FFFFF WA SEATTLE
>
at this point one can drop the original values and keep the corrected version.
# rename & keep corrected version
library(dplyr)
corrected %>% select(-city) %>% rename(city = correctedCity)
An alternative as noted in the comments to the OP would be to create a lookup table that contains rows only for the misspelled city names. In this case we would use the argument all.x = TRUE
in merge()
to keep all rows from the main data frame, and assign the non-missing values of correctedCity
to city
.
tableText <- "city,correctedCity
BOSTN,BOSTON
CHICAGOO,CHIGAGO
LOS ANGELOS,LOS ANGELES
NYC,NEW YORK CITY"
cityTable <- read.csv(text = tableText,header = TRUE,stringsAsFactors = FALSE)
corrected <- merge(cities,cityTable,by = "city",all.x = TRUE)
corrected$city[!is.na(corrected$correctedCity)] <- corrected$correctedCity[!is.na(corrected$correctedCity)]
corrected
...and the output:
> corrected
city id owner state correctedCity
1 BOSTON 1 AAAAA MA BOSTON
2 BOSTON 5 EEEEE MA <NA>
3 CHIGAGO 4 DDDDD IL CHIGAGO
4 LOS ANGELES 8 HHHHH CA <NA>
5 LOS ANGELES 2 BBBBB CA LOS ANGELES
6 NEW YORK CITY 7 GGGGG NY <NA>
7 NEW YORK CITY 3 CCCCC NY NEW YORK CITY
8 SEATTLE 6 FFFFF WA <NA>
>
At this point, correctedCity
can be dropped from the data frame.
It appears to me that what you're trying to do is match and replace incorrect city names in one dataframe by correct city names in another dataframe. If this is correct then this dplyr
solution should work.
Data :
Dataframe with pairs of correct and incorrect city names:
city <- data.frame(
city_correct = c("BOSTON", "LOS ANGELES", "NEW YORK CITY", "CHICAGO"),
city_incorrect = c("BOSTN", "LOS ANGELOS", "NYC", "CHICAGOO"), stringsAsFactors = F)
Dataframe with mix of correct and incorrect city names:
set.seed(123)
df <- data.frame(town = sample(c("BOSTON", "LOS ANGELES", "NEW YORK CITY", "CHICAGO","BOSTN",
"LOS ANGELOS", "NYC", "CHICAGOO"), 20, replace = T), stringsAsFactors = F)
Solution :
library(dplyr)
df <- left_join(df, city, by = c("town" = "city_incorrect"))
df$town_correct<-ifelse(is.na(df$city_correct), df$town, df$city_correct)
df$city_correct <- NULL
EDIT:
Another, base R
, solution is this:
df$town_correct <- ifelse(df$town %in% city$city_incorrect,
city$city_correct[match(df$town, city$city_incorrect)],
df$town[match(df$town, city$city_correct)])
Result :
df
town town_correct
1 NEW YORK CITY NEW YORK CITY
2 NYC NEW YORK CITY
3 CHICAGO CHICAGO
4 CHICAGOO CHICAGO
5 CHICAGOO CHICAGO
6 BOSTON BOSTON
7 BOSTN BOSTON
8 CHICAGOO CHICAGO
9 BOSTN BOSTON
10 CHICAGO CHICAGO
11 CHICAGOO CHICAGO
12 CHICAGO CHICAGO
13 LOS ANGELOS LOS ANGELES
14 BOSTN BOSTON
15 BOSTON BOSTON
16 CHICAGOO CHICAGO
17 LOS ANGELES LOS ANGELES
18 BOSTON BOSTON
19 NEW YORK CITY NEW YORK CITY
20 CHICAGOO CHICAGO
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.