简体   繁体   中英

Convert Speech Start and End Time into Time Series

I am looking to convert the following R data frame into one that is indexed by seconds and have no idea how to do it. Maybe dcast but then in confused on how to expand out the word that's being spoken.

startTime endTime           word
1     1.900s  2.300s         hey
2     2.300s  2.800s         I'm
3     2.800s      3s        John
4         3s  3.400s       right
5     3.400s  3.500s         now
6     3.500s  3.800s           I
7     3.800s  4.300s        help

Time           word
1.900s         hey
2.000s         hey
2.100s         hey
2.200s         hey
2.300s         I'm
2.400s         I'm
2.500s         I'm
2.600s         I'm
2.700s         I'm
2.800s         John
2.900s         John
3.000s         right
3.100s         right
3.200s         right
3.300s         right

One solution can be achieved using tidyr::expand .

EDITED: Based on feedback from OP, as his data got duplicate startTime

library(tidyverse)
step = 0.1
df %>% group_by(rnum = row_number()) %>%
  expand(Time = seq(startTime, max(startTime, (endTime-step)), by=step), word = word) %>%
  arrange(Time) %>% 
  ungroup() %>%
  select(-rnum)

# # A tibble: 24 x 2
# # Groups: word [7]
#    Time word 
#   <dbl> <chr>
# 1  1.90 hey  
# 2  2.00 hey  
# 3  2.10 hey  
# 4  2.20 hey  
# 5  2.30 I'm  
# 6  2.40 I'm  
# 7  2.50 I'm  
# 8  2.60 I'm  
# 9  2.70 I'm  
# 10  2.80 John
# ... with 14 more rows

Data

df <- read.table(text = 
"startTime endTime           word
     1.900  2.300         hey
     2.300  2.800         I'm
     2.800      3        John
     3      3.400       right
     3.400  3.500         now
     3.500  3.800           I
     3.800  4.300        help",
header = TRUE, stringsAsFactors = FALSE)

dcast() is used for reshaping data from long to wide format (thereby aggregating) while the OP wants to reshape from wide to long format thereby filling the missing timestamps.

There is an alternative approach which uses a non-equi join .

Prepare data

However, startTime and endTime need to be turned into numeric variables after removing the trailing "s" before we can proceed.

library(data.table)
cols <- stringr::str_subset(names(DF), "Time$")
setDT(DF)[, (cols) := lapply(.SD, function(x) as.numeric(stringr::str_replace(x, "s", ""))), 
          .SDcols = cols]

Non-equi join

A sequence of timestamps covering the whole period is created and right joined to the dataset but only those timestamps are retained which fall within the given intervall. From the accepted answer, it seems that endTime must not be included in the result. So, the join condition has to be adjusted accordingly.

DF[DF[, CJ(time = seq(min(startTime), max(endTime), 0.1))], 
   on = .(startTime <= time, endTime > time), nomatch = 0L][
     , endTime := NULL][]   # a bit of clean-up
 startTime word 1: 1.9 hey 2: 2.0 hey 3: 2.1 hey 4: 2.2 hey 5: 2.3 I'm 6: 2.4 I'm 7: 2.5 I'm 8: 2.6 I'm 9: 2.7 I'm 10: 2.8 John 11: 2.9 John 12: 3.0 right 13: 3.1 right 14: 3.2 right 15: 3.3 right 16: 3.4 now 17: 3.5 I 18: 3.6 I 19: 3.7 I 20: 3.8 help 21: 3.9 help 22: 4.0 help 23: 4.1 help 24: 4.2 help startTime word

Note that this approach does not require to introduce row numbers.

nomatch = 0L avoids NA rows in case of gaps in the dialogue.

Data

library(data.table)
DF <- fread("
rn startTime endTime           word
1     1.900s  2.300s         hey
2     2.300s  2.800s         I'm
3     2.800s      3s        John
4         3s  3.400s       right
5     3.400s  3.500s         now
6     3.500s  3.800s           I
7     3.800s  4.300s        help
", drop = 1L)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM