[英]Cut a long column of characters between two commas in an R dataframe
I'm working on R, I would like to cut my column to have only the text between the 3rd and 4th comma.我正在研究 R,我想将我的专栏删减为只有第 3 和第 4 个逗号之间的文本。
Col1<- c("Sample1")
Col2 <- c("1A00318:268:H27G3DSX3:4:1101:20989:1047KJ758397.1.1794_U;tax=k:Eukaryota,d:Stramenopiles,p:Ochrophyta,c:Bacillariophyta,o:Bacillariophyta_X,f:Raphid-pennate,g:Raphid-pennate_X")
df <- data.frame(Col1, Col2)
Col1 ![]() |
Col2 ![]() |
---|---|
Sample1![]() |
1A00318:268:H27G3DSX3:4:1101:20989:1047KJ758397.1.1794_U;tax=k:Eukaryota,d:Stramenopiles,p:Ochrophyta,c:Bacillariophyta,f:Raphid-pennate,g:Raphid-pennate_X ![]() |
With this table, I would like to have:有了这张桌子,我想拥有:
Col1 ![]() |
Col2 ![]() |
---|---|
Sample1![]() |
Bacillariophyta![]() |
My dataset is really big, does anyone know how I can do this?我的数据集真的很大,有人知道我该怎么做吗?
You can use sapply to extract the 4th element with strsplit command.您可以使用 sapply 通过 strsplit 命令提取第 4 个元素。
df$Col3 <- sapply(df$Col2, function(x)unlist(strsplit(x, ","))[4])
df
# Col1
#1 Sample1
#Col2
#1 #1A00318:268:H27G3DSX3:4:1101:20989:1047KJ758397.1.1794_U;tax=k:Eukaryota,d:Stramenopiles,p:Ochrophyta,c:Bacillariophyta,o:Bacillariophyta_X,f:Raphid-pennate,g:Raphid-pennate_X
# Col3
#1 c:Bacillariophyta
An alternative would be to use sub
:另一种方法是使用
sub
:
sub("^(?:[^,]+,){3}([^,]+).*", "\\1", df$Col2) -> df$Col2
# Col1 Col2
# 1 Sample1 c:Bacillariophyta
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.