簡體   English   中英

R Studio:匹配兩列之間的前n個字符,並從另一列填充值

[英]R Studio: Match first n characters between two columns, and fill in value from another column

我有一個看起來像這樣的 dataframe “city_table”:

+---+---------------------+
|   | city                |
+---+---------------------+
| 1 | Chicago-2234dxsw    |
+---+---------------------+
| 2 | Chicago,IL          |
+---+---------------------+
| 3 | Chicago             |
+---+---------------------+
| 4 | Chicago - 124421xsd |
+---+---------------------+
| 5 | Chicago_2133xx      |
+---+---------------------+
| 6 | Atlanta- 1234xx     |
+---+---------------------+
| 7 | Atlanta, GA         |
+---+---------------------+
| 8 | Atlanta - 123456T   |
+---+---------------------+

我有另一個城市代碼查找表“city_lookup”,如下所示:

+---+--------------+-----------+
|   | city_name    | city_code |
+---+--------------+-----------+
| 1 | Chicago, IL  | 001       |
+---+--------------+-----------+
| 2 | Atlanta, GA  | 002       |
+---+--------------+-----------+

如您所見,“city”中的城市名稱混亂且格式不同,而“city_code”中的城市名稱遵循統一格式(city,STATE)。

我想要決賽桌,通過匹配city_table$citycity_lookup$city_name之間的前 n 個字符(讓我們的一天,n = 7),正確地返回我的城市代碼,像這樣:

+---+---------------------+-----------+
|   | city_name           | city_code |
+---+---------------------+-----------+
| 1 | Chicago-2234dxsw    | 001       |
+---+---------------------+-----------+
| 2 | Chicago,IL          | 001       |
+---+---------------------+-----------+
| 3 | Chicago             | 001       |
+---+---------------------+-----------+
| 4 | Chicago - 124421xsd | 001       |
+---+---------------------+-----------+
| 5 | Chicago_2133xx      | 001       |
+---+---------------------+-----------+
| 6 | Atlanta- 1234xx     | 002       |
+---+---------------------+-----------+
| 7 | Atlanta, GA         | 002       |
+---+---------------------+-----------+
| 8 | Atlanta - 123456T   | 002       |
+---+---------------------+-----------+

我在 R 中執行此操作,最好使用 tidyverse/dplyr。 非常感謝你的幫助!

更好的是,只要完整城市名稱后面的字符始終是非字母,您就可以匹配整個城市名稱:

city_table <- tibble(city = c("Chicago-2234dxsw", "Chicago,IL", "Atlanta - 123456T"))
city_lookup <- tibble(city_name = c("Chicago, IL", "Atlanta, GA"),
                      city_code = c("001", "002"))


city_table %>%
  mutate(city_clean  = gsub("^([a-zA-Z]*).*", "\\1", city)) %>%
  left_join(city_lookup %>%
              mutate(city_clean  = gsub("^([a-zA-Z]*).*", "\\1", city_name, perl = T)),
            by = "city_clean") %>%
  select(-city_clean, -city_name)


  city              city_code
  <chr>             <chr>    
1 Chicago-2234dxsw  001      
2 Chicago,IL        001      
3 Atlanta - 123456T 002 

我們可以使用substring創建列(正如問題中的 OP 所問),然后執行regex_left_join

library(dplyr)
library(fuzzyjoin)
city_table %>%
   mutate(city_sub = substring(city, 1, 7)) %>%
   regex_left_join(city_lookup %>%
                     mutate(city_sub = substring(city_name, 1, 7)), 
             by = 'city_sub')  %>%
   select(city_name = city, city_code)

-輸出

#             city_name city_code
#1    Chicago-2234dxsw       001
#2          Chicago,IL       001
#3             Chicago       001
#4 Chicago - 124421xsd       001
#5      Chicago_2133xx       001
#6     Atlanta- 1234xx       002
#7         Atlanta, GA       002
#8   Atlanta - 123456T       002

數據

city_table <- structure(list(city = c("Chicago-2234dxsw", "Chicago,IL", "Chicago", 
"Chicago - 124421xsd", "Chicago_2133xx", "Atlanta- 1234xx", "Atlanta, GA", 
"Atlanta - 123456T")), class = "data.frame", row.names = c(NA, 
-8L))

city_lookup <- structure(list(city_name = c("Chicago, IL", "Atlanta, GA"), 
city_code = c("001", 
"002")), class = "data.frame", row.names = c(NA, -2L))

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM