简体   繁体   中英

Time series analysis with R statistical software

I have a problem when analysing a dataset.

786650417, 3197478917, 264665794, **2016-12-01T00:00:00Z**, 1, 4, 94717580778

Above is one column in my data set.I want to extract the date and time (which is highlighted) and create a new column with that.

I have tried the following code:

sms <- sms %>%
  mutate(date =str_extract_all(data , "[0-9]+\\-"))

From this code, I got the date which is separated with (-), but not correctly.

sms <- sms %>%
  mutate(time =str_extract_all(data , "[0-9]+\\:"))

From this, I got the time.

Then, I removed the inverted commas using the below R code.

del <- colwise(function(sms) str_replace_all(sms, '\"', ""))
sms <- del(sms)

So, finally, it looked like this.

786650417, 3197478917, 264665794, 2016-12-01T00:00:00Z, 1, 4, 94717580778

I made two columns from the above row. Those two rows are below.

date                    time
c(2016-, 12-)       c(00:, 00:)

But now, I cannot perform any forecasting methods since the date and time are not in the correct format.

Please help me to extract the date and time as the right type from this record.

786650417, 3197478917, 264665794, 2016-12-01T00:00:00Z, 1, 4, 94717580778

In my dataset there are 2 million records like this.

I think you can assign the two columns you want to extract to a variable:

A <- nameOfYourSet$date 

B <- nameOfYOurSet$time

Use $ to precise which label (which columns in your case) you want to consider.

Then try to use the cbind() function to do some concatenation to get what you want:

cbind(A,B) will "paste" B on the right side of A .

Finally, use as .***() functions to get the right format you need in order to perform forecasting (I don't know what format you need).

Let's try this -

library(stringr)
library(chron)
df <- data.frame(raw_data=c("786650417,3197478917,264665794,2016-12-01T00:00:00Z,1,4,94717580778",
                            "786650417,3197478917,264665794,2016-12-01T00:02:17Z,1,4,94717580778",
                            "786650417,3197478917,264665794,2016-12-01T00:28:10Z,1,4,94717580778"))
df$date <- as.Date(str_extract(df$raw_data, "\\d{4}-\\d{2}-\\d{2}"))
df$time <- chron(times=str_extract(df$raw_data, "\\d{2}:\\d{2}:\\d{2}"))
df$datetime <- strptime(paste(str_extract(df$raw_data, "\\d{4}-\\d{2}-\\d{2}"), str_extract(df$raw_data, "\\d{2}:\\d{2}:\\d{2}"), sep = " "), format="%Y-%m-%d %H:%M:%S")
df

note: Time can be formatted as per the real timezone of collected data- you just need to give one more parameter to strptime() eg tz = "GMT" if timestamp is in GMT format or else by default your local system's timezone will be selected (as in this case it's IST).


Don't forget to let us know if it solved your problem by marking it as the right answer :)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM