[英]Create new columns based on the content of strings of another column
I have the following data:我有以下数据:
gene_Id <- c( 'No_id' , 'P1_1_EXN' , 'P1_2_EXN' ,
'P1_1_EXN_O' , 'P1_2_EXN_O' ,
'P2_1_EXN' , 'P2_2_EXN' ,
'P2_1_EXN_O' , 'P2_2_EXN_O' ,
'P1nM1' , 'P2nM1')
Count_F <- c(rep('KL',5),rep('KD',6))
DF <- data.frame(gene_Id , Count_F)
I would like to create three additional columns: first_one should replace the cells which have the pattern '_Number_'
with 'gene_'Number'
for example replace P1_1_EXN
with gene_1
, with possibility to control the name of the rest strings that don't match this criterion.我想创建三个额外的列: first_one应该将具有模式'_Number_'
的单元格替换为'gene_'Number'
例如将P1_1_EXN
替换为gene_1
,并有可能控制与此不匹配的 rest 字符串的名称标准。 also extract the rest of the string after the pattern '_Number_'
like: taking only EXN
in the previous example, and put that in second_one .还提取模式'_Number_'
之后字符串的 rest ,例如:在上一个示例中仅EXN
,并将其放入second_one 。
third_one should replace any cell which has 'P Number'
with 'PREP Number'
for example replace P1_1_EXN
with PREP _1
third_one应将任何具有'P Number'
单元格替换为'PREP Number'
,例如将P1_1_EXN
替换为PREP _1
EDIT: this is the expected output.编辑:这是预期的 output。
PRER <- c ( 'No_P' ,rep('PREP_1' , 4) , rep('PREP_2' , 4) , 'PREP_1' , 'PREP_2')
Gene_Num <- c ('No_num' , 'gene_1' , 'gene_2' , 'gene_1' , 'gene_2' ,'gene_1',
'gene_2', 'gene_1', 'gene_2' , 'NEG' , 'NEG')
Rest <-c('No_rest','EXN','EXN','EXN_O','EXN_O','EXN','EXN','EXN_O','EXN_O', 'Neg','Neg')
New_DF <- cbind(DF,Gene_Num,Rest,PRER)
Thanks a lot in advance.提前非常感谢。
Here is one possibility using the dplyr
package and case_when
.这是使用dplyr
package 和case_when
的一种可能性。
DF %>%
mutate(col1 = case_when(grepl("_\\d_", gene_Id) ~ gsub(".*_(\\d)_.*", "gene_\\1", gene_Id),
TRUE ~ "dummy1"),
col2 = case_when(grepl("_\\d_", gene_Id) ~ gsub("^.*_\\d_", "", gene_Id),
TRUE ~ "dummy2"),
col3 = case_when(grepl("P\\d", gene_Id) ~ gsub(".*P(\\d).*", "PREP_\\1", gene_Id),
TRUE ~ "dummmy3"))
gene_Id Count_F col1 col2 col3
1 No_id KL dummy1 dummy2 dummmy3
2 P1_1_EXN KL gene_1 EXN PREP_1
3 P1_2_EXN KL gene_2 EXN PREP_1
4 P1_1_EXN_O KL gene_1 EXN_O PREP_1
5 P1_2_EXN_O KL gene_2 EXN_O PREP_1
6 P2_1_EXN KD gene_1 EXN PREP_2
7 P2_2_EXN KD gene_2 EXN PREP_2
8 P2_1_EXN_O KD gene_1 EXN_O PREP_2
9 P2_2_EXN_O KD gene_2 EXN_O PREP_2
10 P1nM1 KD dummy1 dummy2 PREP_1
11 P2nM1 KD dummy1 dummy2 PREP_2
Here is a little explanation: first I check whether the desired substring is contained in gene_ID using grepl
.这里有一点解释:首先我使用grepl
检查所需的 ZE83AED3DDF4667DEC0DAAAACB2BB3BE0BZ 是否包含在gene_ID 中。 If yes, I extract it according to the rules.如果是,我按照规则提取。 If not, I assign a dummy value (I named those dummy1, dummy2 and dummy3).如果不是,我分配一个虚拟值(我将它们命名为 dummy1、dummy2 和 dummy3)。
I use regular expression to match the strings: \\d
matches a digit and _\\d_
matches a digit between two underscores.我使用正则表达式来匹配字符串: \\d
匹配一个数字, _\\d_
匹配两个下划线之间的一个数字。 When using gsub
\\1
refers to what ever was matched in the first paranthesis: in this case it is always a digit.当使用gsub
\\1
时,指的是第一个括号中匹配的内容:在这种情况下,它始终是一个数字。
So for example the definition of col1
works like this:例如col1
的定义是这样的:
_\\d_
inside gene_ID
: if yes replace the whole string with gene_\\1
where \\1
is the digit between the underscores.检查是否在gene_ID
中找到模式_\\d_
:如果是,则将整个字符串替换为gene_\\1
,其中\\1
是下划线之间的数字。_\\d_
assign "dummy1".如果您没有找到模式_\\d_
分配“dummy1”。An alternative using dplyr
and stringr
:使用dplyr
和stringr
的替代方法:
DF %>%
mutate(Gene = str_c("gene", str_extract(gene_Id, "_\\d(?=_)")),
Rest = str_extract(gene_Id, "(?<=_\\d_).*"),
P_Number = str_replace(str_extract(gene_Id, "P\\d"), "P", "PREP_"))
returns返回
gene_Id Count_F Gene Rest P_Number
1 No_id KL <NA> <NA> <NA>
2 P1_1_EXN KL gene_1 EXN PREP_1
3 P1_2_EXN KL gene_2 EXN PREP_1
4 P1_1_EXN_O KL gene_1 EXN_O PREP_1
5 P1_2_EXN_O KL gene_2 EXN_O PREP_1
6 P2_1_EXN KD gene_1 EXN PREP_2
7 P2_2_EXN KD gene_2 EXN PREP_2
8 P2_1_EXN_O KD gene_1 EXN_O PREP_2
9 P2_2_EXN_O KD gene_2 EXN_O PREP_2
10 P1nM1 KD <NA> <NA> PREP_1
11 P2nM1 KD <NA> <NA> PREP_2
I didn't include a handle for the <NA>
-cases.我没有包含<NA>
-cases 的句柄。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.