简体   繁体   中英

Cut a long column of characters between two commas in an R dataframe

I'm working on R, I would like to cut my column to have only the text between the 3rd and 4th comma.

Col1<- c("Sample1")
Col2 <- c("1A00318:268:H27G3DSX3:4:1101:20989:1047KJ758397.1.1794_U;tax=k:Eukaryota,d:Stramenopiles,p:Ochrophyta,c:Bacillariophyta,o:Bacillariophyta_X,f:Raphid-pennate,g:Raphid-pennate_X")

df <- data.frame(Col1, Col2)
Col1 Col2
Sample1 1A00318:268:H27G3DSX3:4:1101:20989:1047KJ758397.1.1794_U;tax=k:Eukaryota,d:Stramenopiles,p:Ochrophyta,c:Bacillariophyta,f:Raphid-pennate,g:Raphid-pennate_X

With this table, I would like to have:

Col1 Col2
Sample1 Bacillariophyta

My dataset is really big, does anyone know how I can do this?

You can use sapply to extract the 4th element with strsplit command.

df$Col3 <- sapply(df$Col2, function(x)unlist(strsplit(x, ","))[4])

df

#     Col1
#1 Sample1
                                                                                                                                                                             #Col2
#1 #1A00318:268:H27G3DSX3:4:1101:20989:1047KJ758397.1.1794_U;tax=k:Eukaryota,d:Stramenopiles,p:Ochrophyta,c:Bacillariophyta,o:Bacillariophyta_X,f:Raphid-pennate,g:Raphid-pennate_X
#               Col3
#1 c:Bacillariophyta

An alternative would be to use sub :

sub("^(?:[^,]+,){3}([^,]+).*", "\\1", df$Col2) -> df$Col2

# Col1              Col2
# 1 Sample1 c:Bacillariophyta

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM