简体   繁体   English

在 R 数据框中的两个逗号之间剪切一长列字符

[英]Cut a long column of characters between two commas in an R dataframe

I'm working on R, I would like to cut my column to have only the text between the 3rd and 4th comma.我正在研究 R,我想将我的专栏删减为只有第 3 和第 4 个逗号之间的文本。

Col1<- c("Sample1")
Col2 <- c("1A00318:268:H27G3DSX3:4:1101:20989:1047KJ758397.1.1794_U;tax=k:Eukaryota,d:Stramenopiles,p:Ochrophyta,c:Bacillariophyta,o:Bacillariophyta_X,f:Raphid-pennate,g:Raphid-pennate_X")

df <- data.frame(Col1, Col2)
Col1 Col1 Col2 Col2
Sample1样品1 1A00318:268:H27G3DSX3:4:1101:20989:1047KJ758397.1.1794_U;tax=k:Eukaryota,d:Stramenopiles,p:Ochrophyta,c:Bacillariophyta,f:Raphid-pennate,g:Raphid-pennate_X 1A00318:268:H27G3DSX3:4:1101:20989:1047KJ758397.1.1794_U;tax=k:Eukaryota,d:Stramenopiles,p:Ochrophyta,c:Bacillariophyta,f:Raphid-pennate,g:Raphid-pennate_X

With this table, I would like to have:有了这张桌子,我想拥有:

Col1 Col1 Col2 Col2
Sample1样品1 Bacillariophyta芽孢杆菌门

My dataset is really big, does anyone know how I can do this?我的数据集真的很大,有人知道我该怎么做吗?

You can use sapply to extract the 4th element with strsplit command.您可以使用 sapply 通过 strsplit 命令提取第 4 个元素。

df$Col3 <- sapply(df$Col2, function(x)unlist(strsplit(x, ","))[4])

df

#     Col1
#1 Sample1
                                                                                                                                                                             #Col2
#1 #1A00318:268:H27G3DSX3:4:1101:20989:1047KJ758397.1.1794_U;tax=k:Eukaryota,d:Stramenopiles,p:Ochrophyta,c:Bacillariophyta,o:Bacillariophyta_X,f:Raphid-pennate,g:Raphid-pennate_X
#               Col3
#1 c:Bacillariophyta

An alternative would be to use sub :另一种方法是使用sub

sub("^(?:[^,]+,){3}([^,]+).*", "\\1", df$Col2) -> df$Col2

# Col1              Col2
# 1 Sample1 c:Bacillariophyta

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM