[英]How to merge two data frames with specific string match in columns in R?
我有兩個數據幀data1
和data2
,它們的信息如下:
dput(data1)
structure(list(ProfName = c("Hua (Christine) Xin", "Dereck Barr-Pulliam",
"Lisa M. Blum", "Russell Williamson", "William D. Stout", "Michael F. Wade",
"Sheila A. Johnston", "Julie Huang", "Alan Attaway", "Alan Levitan",
"Benjamin P. Foster", "Carolyn M. Callahan"), Title = c(" PhD",
" PhD", " LLM", " PhD", " PhD", " CPA", " MS", " PhD", " PhD",
" PhD", " PhD", " PhD"), Profession = c("Assistant Professor",
"Assistant Professor", "Instructor", "Assistant Professor", "Associate Professor and Director",
"Instructor", "Instructor", "Associate Professor", "Professor",
"Professor", "Professor", "Brown-Forman Professor of Accountancy"
)), row.names = c(8L, 18L, 25L, 36L, 49L, 50L, 56L, 69L, 71L,
82L, 88L, 89L), class = "data.frame")
它看起來像下面:
dput(data2)
structure(list(ProfName = c("Blandford, K ", "Okafor, A ",
"Johnston, S ", "Rolen, R ", "Attaway, A ", "Xin, H ",
"Huang, Y ", "Stout, W ", "Williamson, R ", "Callahan, C ",
"Foster, B ", "Blum, L ", "Levitan, A ", "Barr-Pulliam, D ",
"Wade, M ")), row.names = c(NA, -15L), class = "data.frame")
data2
如下所示:
我想合並兩個數據框,但名稱看起來不同。 只有特定字符串在兩個數據ProfName
與列ProfName
之間匹配。 數據應該被合並,如果名稱沒有任何信息,它應該是空的。 如果他們在Title
和Profession
列中沒有任何信息,則ProfName
和New
列應該具有相同的名稱。
我嘗試使用merge
,但它沒有給出所需的輸出。
merge(data1, data2, by="ProfName", all.x=TRUE, all.y = TRUE)
輸出應如下所示:
這是一個簡單的解決方案:
library(stringr)
library(dplyr)
library(tidyr)
library(magrittr)
data1 %<>% mutate(lname = str_extract(ProfName, "[A-Za-z\\-]+$"))
data2 %<>% mutate(lname = str_extract(ProfName, "^[A-Za-z\\-]+"))
df <- merge(data1, data2, all.y = TRUE, by = "lname")
head(df)
# lname ProfName.x Title Profession # ProfName.y
# 1 Attaway Alan Attaway PhD Professor Attaway, A
# 2 Barr-Pulliam Dereck Barr-Pulliam PhD Assistant Professor Barr-Pulliam, D
# 3 Blandford <NA> <NA> <NA> Blandford, K
# 4 Blum Lisa M. Blum LLM Instructor Blum, L
# 5 Callahan Carolyn M. Callahan PhD Brown-Forman Professor of Accountancy Callahan, C
# 6 Foster Benjamin P. Foster PhD Professor Foster, B
這是否有效:
> library(dplyr)
> df %>% mutate(secName = trimws(gsub('(.*)\\s(.*)$', '\\2', ProfName))) %>%
+ right_join(df1 %>% mutate(secName = trimws(gsub('(.*)(, .)', '\\1',ProfName))) %>% rename(new = ProfName)) %>%
+ mutate(ProfName = coalesce(ProfName, new)) %>%
+ select(-secName)
Joining, by = "secName"
ProfName Title Profession new
1 Hua (Christine) Xin PhD Assistant Professor Xin, H
2 Dereck Barr-Pulliam PhD Assistant Professor Barr-Pulliam, D
3 Lisa M. Blum LLM Instructor Blum, L
4 Russell Williamson PhD Assistant Professor Williamson, R
5 William D. Stout PhD Associate Professor and Director Stout, W
6 Michael F. Wade CPA Instructor Wade, M
7 Sheila A. Johnston MS Instructor Johnston, S
8 Julie Huang PhD Associate Professor Huang, Y
9 Alan Attaway PhD Professor Attaway, A
10 Alan Levitan PhD Professor Levitan, A
11 Benjamin P. Foster PhD Professor Foster, B
12 Carolyn M. Callahan PhD Brown-Forman Professor of Accountancy Callahan, C
13 Blandford, K <NA> <NA> Blandford, K
14 Okafor, A <NA> <NA> Okafor, A
15 Rolen, R <NA> <NA> Rolen, R
>
使用的數據:
> df
ProfName Title Profession
8 Hua (Christine) Xin PhD Assistant Professor
18 Dereck Barr-Pulliam PhD Assistant Professor
25 Lisa M. Blum LLM Instructor
36 Russell Williamson PhD Assistant Professor
49 William D. Stout PhD Associate Professor and Director
50 Michael F. Wade CPA Instructor
56 Sheila A. Johnston MS Instructor
69 Julie Huang PhD Associate Professor
71 Alan Attaway PhD Professor
82 Alan Levitan PhD Professor
88 Benjamin P. Foster PhD Professor
89 Carolyn M. Callahan PhD Brown-Forman Professor of Accountancy
> df1
ProfName
1 Blandford, K
2 Okafor, A
3 Johnston, S
4 Rolen, R
5 Attaway, A
6 Xin, H
7 Huang, Y
8 Stout, W
9 Williamson, R
10 Callahan, C
11 Foster, B
12 Blum, L
13 Levitan, A
14 Barr-Pulliam, D
15 Wade, M
>
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.