[英]Vectorizing for-loop in R for creating strings with different length
I have created a sample R script to show my question: 我创建了一个示例R脚本来显示我的问题:
test.df <- data.frame(uid=c('x001','x002','x003'),
start_date=c('2015-01-02','2015-03-05','2015-07-09'),
end_date=c('2015-01-07','2015-03-07','2015-07-16'),
stringsAsFactors=FALSE)
test.df[,'start_date'] <- as.Date(test.df[,'start_date'])
test.df[,'end_date'] <- as.Date(test.df[,'end_date'])
for (loop in (1:nrow(test.df))) {
test.df[loop,'output'] <- paste(seq(test.df[loop,'start_date'],test.df[loop,'end_date'],by = 1),collapse=';')
}
I need to create strings of date with different length, I can only think of using for-loop
for my problem, but I have about 70K cases that need to process the string, is there any way of speeding it up? 我需要创建具有不同长度的日期字符串,我只能考虑使用for-loop
解决问题,但是我有大约70K个案例需要处理该字符串,有什么方法可以加快速度吗?
Thanks @akrun for the answer, I have further modified my question as below: 感谢@akrun的回答,我进一步修改了我的问题,如下所示:
library(dplyr)
test.df <- data.frame(uid=c('x001','x002','x003'),
start_date=c('2015-01-02','2015-03-05','2015-07-09'),
end_date=c('2015-01-07','2015-03-07','2015-07-16'),
stringsAsFactors=FALSE)
test.df[,'start_date'] <- as.Date(test.df[,'start_date'])
test.df[,'end_date'] <- as.Date(test.df[,'end_date'])
# Part A
for (loop in (1:nrow(test.df))) {
test.df[loop,'output'] <- paste(seq(test.df[loop,'start_date'],test.df[loop,'end_date'],by = 1),collapse=';')
}
# Part B
test.mod <- group_by(test.df,uid) %>%
do({df <- data.frame(.)
output.df <- data.frame(uid=df[1,'uid'],
date=unlist(strsplit(df[,'output'],';')))
data.frame(output.df)
})
Now Part A
is fixed, but is there anyway to speed up Part B
? 现在, Part A
是固定的,但是仍然有加快Part B
速度吗? Or should I combine Part A
and Part B
together? 还是应该将Part A
Part B
和Part B
结合在一起? Please enlighten me as data.table
is new to me. 请启发我,因为data.table
对我来说是新的。
We could convert the 'test.df' to 'data.table' ( setDT(test.df)
), grouped by 'uid', we get the seq
of 'start_date', 'end_date' and the paste
the elements together. 我们可以转换“test.df”到“data.table”( setDT(test.df)
由“UID”分组,我们得到的seq
“起始日期”,“END_DATE”和中paste
的元素结合在一起。
library(data.table)
setDT(test.df)[,paste(seq(start_date, end_date, by = '1 day'), collapse=';') , uid]
For the Part B, if we dont paste
, it is a two column dataset 对于B部分,如果不paste
,则为两列数据集
setDT(test.df)[,seq(start_date, end_date, by = '1 day') , uid]
Here is how you can do it with apply 这是您可以通过Apply进行的方法
test.df <- data.frame(uid=c('x001','x002','x003'),
start_date=c('2015-01-02','2015-03-05','2015-07-09'),
end_date=c('2015-01-07','2015-03-07','2015-07-16'),
stringsAsFactors=FALSE)
test.df$output <- apply(test.df, 1, function(x) paste(seq(as.Date(x[2]), as.Date(x[3]), by = 1), collapse=';'))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.