I have the following data:
gene_Id <- c( 'No_id' , 'P1_1_EXN' , 'P1_2_EXN' ,
'P1_1_EXN_O' , 'P1_2_EXN_O' ,
'P2_1_EXN' , 'P2_2_EXN' ,
'P2_1_EXN_O' , 'P2_2_EXN_O' ,
'P1nM1' , 'P2nM1')
Count_F <- c(rep('KL',5),rep('KD',6))
DF <- data.frame(gene_Id , Count_F)
I would like to create three additional columns: first_one should replace the cells which have the pattern '_Number_'
with 'gene_'Number'
for example replace P1_1_EXN
with gene_1
, with possibility to control the name of the rest strings that don't match this criterion. also extract the rest of the string after the pattern '_Number_'
like: taking only EXN
in the previous example, and put that in second_one .
third_one should replace any cell which has 'P Number'
with 'PREP Number'
for example replace P1_1_EXN
with PREP _1
EDIT: this is the expected output.
PRER <- c ( 'No_P' ,rep('PREP_1' , 4) , rep('PREP_2' , 4) , 'PREP_1' , 'PREP_2')
Gene_Num <- c ('No_num' , 'gene_1' , 'gene_2' , 'gene_1' , 'gene_2' ,'gene_1',
'gene_2', 'gene_1', 'gene_2' , 'NEG' , 'NEG')
Rest <-c('No_rest','EXN','EXN','EXN_O','EXN_O','EXN','EXN','EXN_O','EXN_O', 'Neg','Neg')
New_DF <- cbind(DF,Gene_Num,Rest,PRER)
Thanks a lot in advance.
Here is one possibility using the dplyr
package and case_when
.
DF %>%
mutate(col1 = case_when(grepl("_\\d_", gene_Id) ~ gsub(".*_(\\d)_.*", "gene_\\1", gene_Id),
TRUE ~ "dummy1"),
col2 = case_when(grepl("_\\d_", gene_Id) ~ gsub("^.*_\\d_", "", gene_Id),
TRUE ~ "dummy2"),
col3 = case_when(grepl("P\\d", gene_Id) ~ gsub(".*P(\\d).*", "PREP_\\1", gene_Id),
TRUE ~ "dummmy3"))
gene_Id Count_F col1 col2 col3
1 No_id KL dummy1 dummy2 dummmy3
2 P1_1_EXN KL gene_1 EXN PREP_1
3 P1_2_EXN KL gene_2 EXN PREP_1
4 P1_1_EXN_O KL gene_1 EXN_O PREP_1
5 P1_2_EXN_O KL gene_2 EXN_O PREP_1
6 P2_1_EXN KD gene_1 EXN PREP_2
7 P2_2_EXN KD gene_2 EXN PREP_2
8 P2_1_EXN_O KD gene_1 EXN_O PREP_2
9 P2_2_EXN_O KD gene_2 EXN_O PREP_2
10 P1nM1 KD dummy1 dummy2 PREP_1
11 P2nM1 KD dummy1 dummy2 PREP_2
Here is a little explanation: first I check whether the desired substring is contained in gene_ID using grepl
. If yes, I extract it according to the rules. If not, I assign a dummy value (I named those dummy1, dummy2 and dummy3).
I use regular expression to match the strings: \\d
matches a digit and _\\d_
matches a digit between two underscores. When using gsub
\\1
refers to what ever was matched in the first paranthesis: in this case it is always a digit.
So for example the definition of col1
works like this:
_\\d_
inside gene_ID
: if yes replace the whole string with gene_\\1
where \\1
is the digit between the underscores._\\d_
assign "dummy1".An alternative using dplyr
and stringr
:
DF %>%
mutate(Gene = str_c("gene", str_extract(gene_Id, "_\\d(?=_)")),
Rest = str_extract(gene_Id, "(?<=_\\d_).*"),
P_Number = str_replace(str_extract(gene_Id, "P\\d"), "P", "PREP_"))
returns
gene_Id Count_F Gene Rest P_Number
1 No_id KL <NA> <NA> <NA>
2 P1_1_EXN KL gene_1 EXN PREP_1
3 P1_2_EXN KL gene_2 EXN PREP_1
4 P1_1_EXN_O KL gene_1 EXN_O PREP_1
5 P1_2_EXN_O KL gene_2 EXN_O PREP_1
6 P2_1_EXN KD gene_1 EXN PREP_2
7 P2_2_EXN KD gene_2 EXN PREP_2
8 P2_1_EXN_O KD gene_1 EXN_O PREP_2
9 P2_2_EXN_O KD gene_2 EXN_O PREP_2
10 P1nM1 KD <NA> <NA> PREP_1
11 P2nM1 KD <NA> <NA> PREP_2
I didn't include a handle for the <NA>
-cases.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.