简体   繁体   中英

R Studio: Match first n characters between two columns, and fill in value from another column

I have a dataframe "city_table" that looks like this:

+---+---------------------+
|   | city                |
+---+---------------------+
| 1 | Chicago-2234dxsw    |
+---+---------------------+
| 2 | Chicago,IL          |
+---+---------------------+
| 3 | Chicago             |
+---+---------------------+
| 4 | Chicago - 124421xsd |
+---+---------------------+
| 5 | Chicago_2133xx      |
+---+---------------------+
| 6 | Atlanta- 1234xx     |
+---+---------------------+
| 7 | Atlanta, GA         |
+---+---------------------+
| 8 | Atlanta - 123456T   |
+---+---------------------+

I have another city code lookup table "city_lookup" that looks like this:

+---+--------------+-----------+
|   | city_name    | city_code |
+---+--------------+-----------+
| 1 | Chicago, IL  | 001       |
+---+--------------+-----------+
| 2 | Atlanta, GA  | 002       |
+---+--------------+-----------+

As you can see, city names in "city" are messy and formatted differently, while as the city names in "city_code" are following unified format (city,STATE).

I would like the final table that, through matching first n characters (let's day, n=7) between city_table$city vs. city_lookup$city_name , return me the city code properly, sth like this:

+---+---------------------+-----------+
|   | city_name           | city_code |
+---+---------------------+-----------+
| 1 | Chicago-2234dxsw    | 001       |
+---+---------------------+-----------+
| 2 | Chicago,IL          | 001       |
+---+---------------------+-----------+
| 3 | Chicago             | 001       |
+---+---------------------+-----------+
| 4 | Chicago - 124421xsd | 001       |
+---+---------------------+-----------+
| 5 | Chicago_2133xx      | 001       |
+---+---------------------+-----------+
| 6 | Atlanta- 1234xx     | 002       |
+---+---------------------+-----------+
| 7 | Atlanta, GA         | 002       |
+---+---------------------+-----------+
| 8 | Atlanta - 123456T   | 002       |
+---+---------------------+-----------+

I am doing this in R, preferably using tidyverse/dplyr. Thanks so much for your help!

Even better, as long as the characters after the full city names are always non-letters, you can match the entire city name as so:

city_table <- tibble(city = c("Chicago-2234dxsw", "Chicago,IL", "Atlanta - 123456T"))
city_lookup <- tibble(city_name = c("Chicago, IL", "Atlanta, GA"),
                      city_code = c("001", "002"))


city_table %>%
  mutate(city_clean  = gsub("^([a-zA-Z]*).*", "\\1", city)) %>%
  left_join(city_lookup %>%
              mutate(city_clean  = gsub("^([a-zA-Z]*).*", "\\1", city_name, perl = T)),
            by = "city_clean") %>%
  select(-city_clean, -city_name)


  city              city_code
  <chr>             <chr>    
1 Chicago-2234dxsw  001      
2 Chicago,IL        001      
3 Atlanta - 123456T 002 

We can create columns with substring (as the OP asked in the question) and then do a regex_left_join

library(dplyr)
library(fuzzyjoin)
city_table %>%
   mutate(city_sub = substring(city, 1, 7)) %>%
   regex_left_join(city_lookup %>%
                     mutate(city_sub = substring(city_name, 1, 7)), 
             by = 'city_sub')  %>%
   select(city_name = city, city_code)

-output

#             city_name city_code
#1    Chicago-2234dxsw       001
#2          Chicago,IL       001
#3             Chicago       001
#4 Chicago - 124421xsd       001
#5      Chicago_2133xx       001
#6     Atlanta- 1234xx       002
#7         Atlanta, GA       002
#8   Atlanta - 123456T       002

data

city_table <- structure(list(city = c("Chicago-2234dxsw", "Chicago,IL", "Chicago", 
"Chicago - 124421xsd", "Chicago_2133xx", "Atlanta- 1234xx", "Atlanta, GA", 
"Atlanta - 123456T")), class = "data.frame", row.names = c(NA, 
-8L))

city_lookup <- structure(list(city_name = c("Chicago, IL", "Atlanta, GA"), 
city_code = c("001", 
"002")), class = "data.frame", row.names = c(NA, -2L))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM