简体   繁体   English

向量化R中的for循环以创建具有不同长度的字符串

[英]Vectorizing for-loop in R for creating strings with different length

I have created a sample R script to show my question: 我创建了一个示例R脚本来显示我的问题:

test.df <- data.frame(uid=c('x001','x002','x003'),
                      start_date=c('2015-01-02','2015-03-05','2015-07-09'),
                      end_date=c('2015-01-07','2015-03-07','2015-07-16'),
                      stringsAsFactors=FALSE) 
test.df[,'start_date'] <- as.Date(test.df[,'start_date']) 
test.df[,'end_date'] <- as.Date(test.df[,'end_date']) 
for (loop in (1:nrow(test.df))) {   
    test.df[loop,'output'] <- paste(seq(test.df[loop,'start_date'],test.df[loop,'end_date'],by = 1),collapse=';') 
}

I need to create strings of date with different length, I can only think of using for-loop for my problem, but I have about 70K cases that need to process the string, is there any way of speeding it up? 我需要创建具有不同长度的日期字符串,我只能考虑使用for-loop解决问题,但是我有大约70K个案例需要处理该字符串,有什么方法可以加快速度吗?

Update 01 更新01

Thanks @akrun for the answer, I have further modified my question as below: 感谢@akrun的回答,我进一步修改了我的问题,如下所示:

library(dplyr)

test.df <- data.frame(uid=c('x001','x002','x003'),
                      start_date=c('2015-01-02','2015-03-05','2015-07-09'),
                      end_date=c('2015-01-07','2015-03-07','2015-07-16'),
                      stringsAsFactors=FALSE)
test.df[,'start_date'] <- as.Date(test.df[,'start_date'])
test.df[,'end_date'] <- as.Date(test.df[,'end_date'])

# Part A
for (loop in (1:nrow(test.df))) {   
  test.df[loop,'output'] <- paste(seq(test.df[loop,'start_date'],test.df[loop,'end_date'],by = 1),collapse=';') 
}

# Part B
test.mod <- group_by(test.df,uid) %>%
  do({df <- data.frame(.)
  output.df <- data.frame(uid=df[1,'uid'],
                          date=unlist(strsplit(df[,'output'],';')))
  data.frame(output.df)
  })

Now Part A is fixed, but is there anyway to speed up Part B ? 现在, Part A是固定的,但是仍然有加快Part B速度吗? Or should I combine Part A and Part B together? 还是应该将Part A Part BPart B结合在一起? Please enlighten me as data.table is new to me. 请启发我,因为data.table对我来说是新的。

We could convert the 'test.df' to 'data.table' ( setDT(test.df) ), grouped by 'uid', we get the seq of 'start_date', 'end_date' and the paste the elements together. 我们可以转换“test.df”到“data.table”( setDT(test.df)由“UID”分组,我们得到的seq “起始日期”,“END_DATE”和中paste的元素结合在一起。

library(data.table)
setDT(test.df)[,paste(seq(start_date, end_date, by = '1 day'), collapse=';') , uid]

Update 更新资料

For the Part B, if we dont paste , it is a two column dataset 对于B部分,如果不paste ,则为两列数据集

setDT(test.df)[,seq(start_date, end_date, by = '1 day') , uid]

Here is how you can do it with apply 这是您可以通过Apply进行的方法

test.df <- data.frame(uid=c('x001','x002','x003'),
                      start_date=c('2015-01-02','2015-03-05','2015-07-09'),
                      end_date=c('2015-01-07','2015-03-07','2015-07-16'),
                      stringsAsFactors=FALSE) 

test.df$output <- apply(test.df, 1, function(x) paste(seq(as.Date(x[2]), as.Date(x[3]), by = 1), collapse=';'))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM