简体   繁体   English

R:根据子字符串匹配合并数据帧

[英]R: Merge data frames based on substring match

I have two data frames that I would like to merge by protein accession names. 我有两个数据框,我想按蛋白质登录名合并。

df1 is a data frame containing protein accession names correlated to a gene (and there are several of these names). df1是一个数据框,其中包含与基因相关的蛋白质登录名(这些名称中有几个)。 df1 thus contain a "list" of these names separated by semicolons in string format with unique values that never occur again in df1. 因此,df1包含这些名称的“列表”,这些名称以字符串格式用分号分隔,具有唯一的值,这些值在df1中再也不会出现。 I have written these names as "A1, B1, ..." below: 我在下面将这些名称写为“ A1,B1,...”:

df1:

Name                a.value
A1;B1;C1            ...
A2                  ...
A3;B3               ...
A4;B4;C4;D4;E4;F4   ...

df2 is a data frame containing only one of these accession named per row: df2是一个数据帧,仅包含每行命名的以下登录之一:

df2:

Name  b.value
A2    ...
B3    ...
B4    ...

Both df1 and df2 contain other columns. df1和df2都包含其他列。

I would like the merged data frame be merged so that rows are matched if the accession name in df2 exists as one of the names df1 as follows: 我希望合并的数据帧被合并,以便如果df2中的登录名作为名称df1之一存在,则行将匹配,如下所示:

A2   A2                 a.value  b.value
B3   A3;B3              ...      ...
B4   A4;B4;C4;D4;E4;F4  ...      ...

And of course, other columns from both data frames are included. 当然,还包括两个数据帧中的其他列。

Any suggestions are greatly appreciated, and let me know if you need me to elaborate on something. 任何建议,我们将不胜感激,如果您需要我详细说明,请告诉我。

Thanks! 谢谢!

This gives the requested output: 这给出了请求的输出:

l <- strsplit(as.character(df1$Name), ';')
df1new <- data.frame(Name = unlist(l), Name.string = rep(df1$Name, lengths(l)))
merge(df2, df1new, by = 'Name', all.x = TRUE)

The result: 结果:

   Name       Name.string
1:   A2                A2
2:   B3             A3;B3
3:   B4 A4;B4;C4;D4;E4;F4

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM