簡體   English   中英

如何在R中的列中合並具有特定字符串匹配的兩個數據框?

[英]How to merge two data frames with specific string match in columns in R?

我有兩個數據幀data1data2 ,它們的信息如下:

dput(data1)

structure(list(ProfName = c("Hua (Christine) Xin", "Dereck Barr-Pulliam", 
"Lisa M. Blum", "Russell  Williamson", "William D. Stout", "Michael F. Wade", 
"Sheila A.  Johnston", "Julie Huang", "Alan Attaway", "Alan Levitan", 
"Benjamin P. Foster", "Carolyn M.  Callahan"), Title = c(" PhD", 
" PhD", " LLM", " PhD", " PhD", " CPA", " MS", " PhD", " PhD", 
" PhD", " PhD", " PhD"), Profession = c("Assistant Professor", 
"Assistant Professor", "Instructor", "Assistant Professor", "Associate Professor and Director", 
"Instructor", "Instructor", "Associate Professor", "Professor", 
"Professor", "Professor", "Brown-Forman Professor of Accountancy"
)), row.names = c(8L, 18L, 25L, 36L, 49L, 50L, 56L, 69L, 71L, 
82L, 88L, 89L), class = "data.frame")

它看起來像下面:

在此處輸入圖片說明

dput(data2)

structure(list(ProfName = c("Blandford, K     ", "Okafor, A     ", 
"Johnston, S     ", "Rolen, R     ", "Attaway, A     ", "Xin, H     ", 
"Huang, Y     ", "Stout, W     ", "Williamson, R     ", "Callahan, C     ", 
"Foster, B     ", "Blum, L     ", "Levitan, A     ", "Barr-Pulliam, D     ", 
"Wade, M     ")), row.names = c(NA, -15L), class = "data.frame")

data2如下所示:

在此處輸入圖片說明

我想合並兩個數據框,但名稱看​​起來不同。 只有特定字符串在兩個數據ProfName與列ProfName之間匹配。 數據應該被合並,如果名稱沒有任何信息,它應該是空的。 如果他們在TitleProfession列中沒有任何信息,則ProfNameNew列應該具有相同的名稱。

我嘗試使用merge ,但它沒有給出所需的輸出。

merge(data1, data2, by="ProfName", all.x=TRUE, all.y = TRUE)

輸出應如下所示:

在此處輸入圖片說明

這是一個簡單的解決方案:

library(stringr)
library(dplyr)
library(tidyr)
library(magrittr)

data1 %<>% mutate(lname = str_extract(ProfName, "[A-Za-z\\-]+$"))
data2 %<>% mutate(lname = str_extract(ProfName, "^[A-Za-z\\-]+"))

df <- merge(data1, data2, all.y = TRUE, by = "lname")

head(df)

#          lname           ProfName.x Title                            Profession           # ProfName.y
# 1      Attaway         Alan Attaway   PhD                             Professor      Attaway, A     
# 2 Barr-Pulliam  Dereck Barr-Pulliam   PhD                   Assistant Professor Barr-Pulliam, D     
# 3    Blandford                 <NA>  <NA>                                  <NA>    Blandford, K     
# 4         Blum         Lisa M. Blum   LLM                            Instructor         Blum, L     
# 5     Callahan Carolyn M.  Callahan   PhD Brown-Forman Professor of Accountancy     Callahan, C     
# 6       Foster   Benjamin P. Foster   PhD                             Professor       Foster, B 

這是否有效:

> library(dplyr)
> df %>% mutate(secName = trimws(gsub('(.*)\\s(.*)$', '\\2', ProfName))) %>% 
+   right_join(df1 %>% mutate(secName = trimws(gsub('(.*)(, .)', '\\1',ProfName))) %>% rename(new = ProfName)) %>% 
+   mutate(ProfName = coalesce(ProfName, new)) %>% 
+   select(-secName)
Joining, by = "secName"
               ProfName Title                            Profession                  new
1   Hua (Christine) Xin   PhD                   Assistant Professor          Xin, H     
2   Dereck Barr-Pulliam   PhD                   Assistant Professor Barr-Pulliam, D     
3          Lisa M. Blum   LLM                            Instructor         Blum, L     
4   Russell  Williamson   PhD                   Assistant Professor   Williamson, R     
5      William D. Stout   PhD      Associate Professor and Director        Stout, W     
6       Michael F. Wade   CPA                            Instructor         Wade, M     
7   Sheila A.  Johnston    MS                            Instructor     Johnston, S     
8           Julie Huang   PhD                   Associate Professor        Huang, Y     
9          Alan Attaway   PhD                             Professor      Attaway, A     
10         Alan Levitan   PhD                             Professor      Levitan, A     
11   Benjamin P. Foster   PhD                             Professor       Foster, B     
12 Carolyn M.  Callahan   PhD Brown-Forman Professor of Accountancy     Callahan, C     
13    Blandford, K       <NA>                                  <NA>    Blandford, K     
14       Okafor, A       <NA>                                  <NA>       Okafor, A     
15        Rolen, R       <NA>                                  <NA>        Rolen, R     
> 

使用的數據:

> df
               ProfName Title                            Profession
8   Hua (Christine) Xin   PhD                   Assistant Professor
18  Dereck Barr-Pulliam   PhD                   Assistant Professor
25         Lisa M. Blum   LLM                            Instructor
36  Russell  Williamson   PhD                   Assistant Professor
49     William D. Stout   PhD      Associate Professor and Director
50      Michael F. Wade   CPA                            Instructor
56  Sheila A.  Johnston    MS                            Instructor
69          Julie Huang   PhD                   Associate Professor
71         Alan Attaway   PhD                             Professor
82         Alan Levitan   PhD                             Professor
88   Benjamin P. Foster   PhD                             Professor
89 Carolyn M.  Callahan   PhD Brown-Forman Professor of Accountancy
> df1
               ProfName
1     Blandford, K     
2        Okafor, A     
3      Johnston, S     
4         Rolen, R     
5       Attaway, A     
6           Xin, H     
7         Huang, Y     
8         Stout, W     
9    Williamson, R     
10     Callahan, C     
11       Foster, B     
12         Blum, L     
13      Levitan, A     
14 Barr-Pulliam, D     
15         Wade, M     
> 

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM