简体   繁体   中英

Replace values in one column based on another dataframe in R

I have a dataframe with over 20k obs. One of the columns is "city names" (df$city). There are over 600 unique city names. Some of them are misspelled.

Example of my dataframe:

> df$city
[1] "BOSTN" "LOS ANGELOS" "NYC" "CHICAGOO" 
[2] "SEATTLE" "BOSTON" "NEW YORK CITY"

I have a csv file I created that has a list of all the misspelled city names and what the correct name should be.

> head(city)
           city    city_incorrect
1 BOSTON                    BOSTN
2 LOS ANGELES         LOS ANGELOS
3 NEW YORK CITY               NYC
4 CHICAGO                CHICAGOO

Ideally I would write code that replaces values in df$city based on the "city.csv" file.

Note: I originally posted this question and someone suggested I use merge, I don't think this is the most efficient way to solve my problem because I would also have to include the 600 correctly spelled cities in my "city.csv" file. OR I think I'd need an additional step that combines the two columns from the merge dataframe. So I think it's probably easier to just REPLACE values in df$city based on "city.csv".

EDIT: Here's a more detailed look at my dataframe

> df[1:5]
id   owner   city            state
1    AAAAA   BOSTN              MA
2    BBBBB   LOS ANGELOS        CA
3    CCCCC   NYC                NY
4    DDDDD   CHICAGOO           IL
5    EEEEE   BOSTON             MA
6    FFFFF   SEATTLE            WA
7    GGGGG   NEW YORK CITY      NY
8    HHHHH   LOS ANGELES        CA

If I use merge or cbind won't it just create another column at the end of my dataframe like this:

> merge()
id   owner   city            state     city_correct
1    AAAAA   BOSTN              MA           BOSTON
2    BBBBB   LOS ANGELOS        CA      LOS ANGELES
3    CCCCC   NYC                NY    NEW YORK CITY
4    DDDDD   CHICAGOO           IL          CHICAGO
5    EEEEE   BOSTON             MA
6    FFFFF   SEATTLE            WA
7    GGGGG   NEW YORK CITY      NY
8    HHHHH   LOS ANGELES        CA

So the cities with misspelling will be corrected, but the cities that are spelled correctly will be left out. What I want in the end is one column that has all the corrected city names.

One approach with base::merge() is to include rows in the lookup table that have the correct value of city, and merge that table with the original data. We'll call the "correct" city names correctedCity , and merge as follows:

cityText <- "id,owner,city,state
1,AAAAA,BOSTN,MA
2,BBBBB,LOS ANGELOS,CA
3,CCCCC,NYC,NY
4,DDDDD,CHICAGOO,IL
5,EEEEE,BOSTON,MA
6,FFFFF,SEATTLE,WA
7,GGGGG,NEW YORK CITY,NY
8,HHHHH,LOS ANGELES,CA"

cities <- read.csv(text = cityText, header = TRUE, stringsAsFactors = FALSE)

# first, find all the distinct versions of city
library(sqldf)
distinctCities <- sqldf("select city, count(*) as count from cities group by city")

# create lookup table, and include rows for items that are already correct 
tableText <- "city,correctedCity
BOSTN,BOSTON
BOSTON,BOSTON
CHICAGOO,CHIGAGO
LOS ANGELES,LOS ANGELES
LOS ANGELOS,LOS ANGELES
NEW YORK CITY,NEW YORK CITY
NYC,NEW YORK CITY
SEATTLE,SEATTLE"

cityTable <- read.csv(text = tableText,header = TRUE,stringsAsFactors = FALSE)
corrected <- merge(cities,cityTable,by = "city")
corrected

...and the output:

> corrected
           city id owner state correctedCity
1         BOSTN  1 AAAAA    MA        BOSTON
2        BOSTON  5 EEEEE    MA        BOSTON
3      CHICAGOO  4 DDDDD    IL       CHIGAGO
4   LOS ANGELES  8 HHHHH    CA   LOS ANGELES
5   LOS ANGELOS  2 BBBBB    CA   LOS ANGELES
6 NEW YORK CITY  7 GGGGG    NY NEW YORK CITY
7           NYC  3 CCCCC    NY NEW YORK CITY
8       SEATTLE  6 FFFFF    WA       SEATTLE
>

at this point one can drop the original values and keep the corrected version.

# rename & keep corrected version
library(dplyr)
corrected %>% select(-city) %>% rename(city = correctedCity) 

An alternative as noted in the comments to the OP would be to create a lookup table that contains rows only for the misspelled city names. In this case we would use the argument all.x = TRUE in merge() to keep all rows from the main data frame, and assign the non-missing values of correctedCity to city .

tableText <- "city,correctedCity
BOSTN,BOSTON
CHICAGOO,CHIGAGO
LOS ANGELOS,LOS ANGELES
NYC,NEW YORK CITY"

cityTable <- read.csv(text = tableText,header = TRUE,stringsAsFactors = FALSE)
corrected <- merge(cities,cityTable,by = "city",all.x = TRUE)
corrected$city[!is.na(corrected$correctedCity)] <- corrected$correctedCity[!is.na(corrected$correctedCity)]
corrected

...and the output:

> corrected
           city id owner state correctedCity
1        BOSTON  1 AAAAA    MA        BOSTON
2        BOSTON  5 EEEEE    MA          <NA>
3       CHIGAGO  4 DDDDD    IL       CHIGAGO
4   LOS ANGELES  8 HHHHH    CA          <NA>
5   LOS ANGELES  2 BBBBB    CA   LOS ANGELES
6 NEW YORK CITY  7 GGGGG    NY          <NA>
7 NEW YORK CITY  3 CCCCC    NY NEW YORK CITY
8       SEATTLE  6 FFFFF    WA          <NA>
> 

At this point, correctedCity can be dropped from the data frame.

It appears to me that what you're trying to do is match and replace incorrect city names in one dataframe by correct city names in another dataframe. If this is correct then this dplyr solution should work.

Data :

Dataframe with pairs of correct and incorrect city names:

city <- data.frame(
  city_correct = c("BOSTON", "LOS ANGELES", "NEW YORK CITY", "CHICAGO"),
  city_incorrect = c("BOSTN", "LOS ANGELOS", "NYC", "CHICAGOO"), stringsAsFactors = F)

Dataframe with mix of correct and incorrect city names:

set.seed(123)
df <- data.frame(town = sample(c("BOSTON", "LOS ANGELES", "NEW YORK CITY", "CHICAGO","BOSTN", 
                                 "LOS ANGELOS", "NYC", "CHICAGOO"), 20, replace = T), stringsAsFactors = F)

Solution :

library(dplyr)
df <- left_join(df, city, by = c("town" = "city_incorrect"))
df$town_correct<-ifelse(is.na(df$city_correct), df$town, df$city_correct)
df$city_correct <- NULL

EDIT:

Another, base R , solution is this:

df$town_correct <- ifelse(df$town %in% city$city_incorrect, 
                          city$city_correct[match(df$town, city$city_incorrect)], 
                          df$town[match(df$town, city$city_correct)])

Result :

df
            town  town_correct
1  NEW YORK CITY NEW YORK CITY
2            NYC NEW YORK CITY
3        CHICAGO       CHICAGO
4       CHICAGOO       CHICAGO
5       CHICAGOO       CHICAGO
6         BOSTON        BOSTON
7          BOSTN        BOSTON
8       CHICAGOO       CHICAGO
9          BOSTN        BOSTON
10       CHICAGO       CHICAGO
11      CHICAGOO       CHICAGO
12       CHICAGO       CHICAGO
13   LOS ANGELOS   LOS ANGELES
14         BOSTN        BOSTON
15        BOSTON        BOSTON
16      CHICAGOO       CHICAGO
17   LOS ANGELES   LOS ANGELES
18        BOSTON        BOSTON
19 NEW YORK CITY NEW YORK CITY
20      CHICAGOO       CHICAGO

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM