简体   繁体   中英

Equivalent of this string replacement code in R?

I have the following code that works really well to remove characters from the end of elements in a Python list:

x = ['01/01/2013 00:00:00','01/01/2013 00:00:00',
    '01/01/2013 00:00:00','01/01/2013 00:00:00',...]

Assuming that array, I want to remove the 00:00:00 part. So, I wrote this:

i = 0
while i < len(x):
    x[i] = x[i][:x[i].find(' 00:00:00')]
    i += 1

This does the trick. How can I implement a similar solution in R? I've tried substr and gsub , but they run really slow (the actual list has over 250,000 date/time combos).

Try

x <- rep('01/01/2013 00:00:00', 250000)
system.time(y <- sub(" 00:00:00", "", x, fixed=TRUE))
# User      System verstrichen 
# 0.05        0.00        0.05 

y contains the result. Timing shows that it should not take too long. See ?sub for help on the parameters.

Consider some sample data:

set.seed(144)
dat <- sample(c("01/01/2013 00:00:00", "01/01/2013 12:34:56"), 200000, replace=T)
table(dat)
# dat
# 01/01/2013 00:00:00 01/01/2013 12:34:56 
#              100100               99900 

Here, we want to remove the trailing 00:00:00 but keep the trailing 12:34:56.

You could first find 00:00:00 at the end of the string with the following (runs in ~0.1 seconds on my computer):

to.clean <- grepl(" 00:00:00$", dat)

Now you can use substr to remove the relevant trailing characters (runs in ~0.04 seconds on my computer):

dat[to.clean] <- substr(dat[to.clean], 1, nchar(dat[to.clean])-9)
table(dat)
# dat
#          01/01/2013 01/01/2013 12:34:56 
#              100100               99900 

Alternately, the following more compact gsub command also runs in about 0.15 seconds for these 200,000 date/time pairs:

cleaned <- gsub(" 00:00:00$", "", dat)
table(cleaned)
# cleaned
#          01/01/2013 01/01/2013 12:34:56 
#              100100               99900 

It's possible that you were looping through the data and separately calling substr or gsub on each individual element of your vector, which would certainly be expected to be much slower since it doesn't take advantage of vectorization.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM