繁体   English   中英

R将以空格分隔的字符串数据拆分为列

[英]R Split string data delimited by spaces into columns

我有一个大数据框,只有一列,其中包含用空格分隔的不同数值,我需要在列中进行提取和组织

<Call Begin=6.0982886400000051 End=6.1078732800000051 MaxFreq=40893.5546875 MinFreq=35400.390625 PeakFreq=39672.8515625 PeakFreqs=39672.8515625 39672.8515625 39672.8515625 39672.8515625 39672.8515625 39672.8515625 39672.8515625 39672.8515625 39672.8515625 39672.8515625 39672.8515625 39672.8515625 39062.5 39062.5 39062.5 39062.5 39062.5 39062.5 39062.5 39062.5 39062.5 39062.5 39062.5 39062.5 39062.5 39062.5 39062.5 39062.5 39062.5 39062.5 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37231.4453125 37231.4453125 37231.4453125 37231.4453125 37231.4453125 37231.4453125 37231.4453125 37231.4453125 37231.4453125 37231.4453125 37231.4453125 37231.4453125 37231.4453125 37231.4453125 37231.4453125 37231.4453125 37231.4453125 37231.4453125 37231.4453125 37231.4453125 36621.09375 36621.09375 36621.09375 36621.09375 Intensity=-14.902734633213136 Periodicity=0.853448275862069 Shape=- CallType=cf-n Species=Pipistrellus kuhlii (77%), Pipistrellus nathusii (77%) Custom=false /> 

这是有关我的数据的更多信息

'data.frame':39 obs. of  1 variable $ x1: Factor w/ 120 levels "
<double>25.318181818181806</double>",..: 66 67 68 69 70 71 72 73 74 75...

我需要这样的东西:

     call_begin            call_end         maxfrec         minfrec
1 0.59170816000000048 0.60006400000000049 531.005.859.375 433.349.609.375
2  0.7636582400000006 0.77135872000000061 531.005.859.375  42.724.609.375
         peakfrec
1 482.177.734.375
2 469.970.703.125

我有一些实现此想法的方法,首先尝试使用strsplit拆分成列,然后使用substr函数提取数字并最后用rbind生成表,我发现了一些带有相关主题的线程,但是我可以复制它在我的数据中。

如有任何帮助,我们将不胜感激,如果您不清楚,请告诉我。

与您所描述的类似的解决方案。 该解决方案更加通用,并且不依赖于列数:

text <- '<Call Begin=0.59170816000000048 End=0.60006400000000049 MaxFreq=53100.5859375 MinFreq=43334.9609375 PeakFreq=48217.7734375
<Call Begin=0.7636582400000006 End=0.77135872000000061 MaxFreq=53100.5859375 MinFreq=42724.609375 PeakFreq=46997.0703125'

process_line <- function(line) {
    sp <- strsplit(line, ' ')[[1]][-1]
    cn <- sapply(sp, function(x) strsplit(x, "=")[[1]][1])
    data <- sapply(sp, function(x) as.numeric(strsplit(x, "=")[[1]][2]))
    names(data) <- cn
    data
}

t(sapply(strsplit(text, "\n")[[1]], process_line, USE.NAMES = FALSE))
         Begin       End  MaxFreq  MinFreq PeakFreq
[1,] 0.5917082 0.6000640 53100.59 43334.96 48217.77
[2,] 0.7636582 0.7713587 53100.59 42724.61 46997.07

它基于以下假设:测试未用行分隔,否则strsplit(text, "\\n")[[1]]text分开。 无需使用正则表达式,因为可以通过用=分割较小的块来获取数据

gsub是我的最爱。

strList = list("<Call Begin=0.59170816000000048 End=0.60006400000000049 MaxFreq=53100.5859375 MinFreq=43334.9609375 PeakFreq=48217.7734375", "<Call Begin=0.7636582400000006 End=0.77135872000000061 MaxFreq=53100.5859375 MinFreq=42724.609375 PeakFreq=46997.0703125")

dataExtract <- function(str){
  str = gsub("^<Call Begin=([0-9.]+) End=([0-9.]+) MaxFreq=([0-9.]+) MinFreq=([0-9.]+) PeakFreq=([0-9.]+)", "\\1 \\2 \\3 \\4 \\5", str)

  str = unlist(strsplit(str, " "))

  return(sapply(str, FUN=as.numeric, USE.NAMES=F))
}

#dataExtract(strList[[1]])

res = matrix(unlist(lapply(str, FUN=dataExtract)), ncol=5, byrow=F)
colnames(res) = c("Call Begin", "End", "MaxFreq", "MinFreq", "PeakFreq")

这完全取决于您的数据遵循模式的严格程度。 对于您提供的数据,您可以一次性拆分“”和“ =“,然后一次性提取相关列。

result <- do.call(rbind,lapply(strList,function(s) {strsplit(s,split = "[ =]")[[1]][c(3,5,7,9,11)]}))

然后,您可以使用names()函数随意命名列。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM