根據 R 中的條件合並字符串和時間戳

Question

我有帶有時間戳的語音轉錄：

df
   line speaker                                utterance                   timestamp
1  0001  ID16.1                                    ah-ha 00:00:07.060 - 00:00:07.660
3  0002    <NA>                                      yes 00:00:07.964 - 00:00:08.610
5  0003    <NA> okay so where do we know each other from 00:00:16.350 - 00:00:22.170
7  0004  ID16.2        U uh Upper Rhine Cruises? maybe?  00:00:23.400 - 00:00:26.600
9  0005  ID16.3           yeah? ((pause)) well I do n't- 00:00:26.305 - 00:00:28.210
11 0006  ID16.1                               (...) Meg? 00:00:27.385 - 00:00:29.305
13 0007    <NA>                         do you know Meg? 00:00:29.100 - 00:00:33.879

我需要做的是兩件事：如果speaker是NA ，（i） utterance utterance以及（ii）相應地合並兩個時間戳。

期望的結果是這樣的：

df
   line speaker                                          utterance                   timestamp
1  0001  ID16.1 ah-ha yes okay so where do we know each other from 00:00:07.060 - 00:00:22.170
3  0004  ID16.2                  U uh Upper Rhine Cruises? maybe?  00:00:23.400 - 00:00:26.600
5  0005  ID16.3                     yeah? ((pause)) well I do n't- 00:00:26.305 - 00:00:28.210
7  0006  ID16.1                        (...) Meg? do you know Meg? 00:00:27.385 - 00:00:33.879

我一直在嘗試使用paste0 、 dplyr::lag和dplyr:lead解決問題，但還沒有走遠。

可重現的數據：

df <- structure(list(line = c("0001", "0002", "0003", "0004", "0005", 
                    "0006", "0007"), speaker = c("ID16.1", NA, NA, "ID16.2", 
                                                 "ID16.3", "ID16.1", NA), utterance = c("ah-ha", "yes", 
                                                                                              "okay so where do we know each other from", 
                                                                                              "U uh Upper Rhine Cruises? maybe? ", "yeah? ((pause)) well I do n't-", 
                                                                                              "(...) Meg?", "do you know Meg?"
                                                 ), timestamp = c("00:00:07.060 - 00:00:07.660", "00:00:07.964 - 00:00:08.610", 
                                                                  "00:00:16.350 - 00:00:22.170", "00:00:23.400 - 00:00:26.600", 
                                                                  "00:00:26.305 - 00:00:28.210", "00:00:27.385 - 00:00:29.305", 
                                                                  "00:00:29.100 - 00:00:33.879")), row.names = c(1L, 3L, 5L, 7L, 
                                                                                                                 9L, 11L, 13L), class = "data.frame")

Answer 1

試試dplyr::group_by 。 僅供參考，您顯示的數據與您的df不同，這會改變聚合。

library(dplyr)
df %>%
  group_by(notna = cumsum(!is.na(speaker))) %>%
  summarize(
    line = first(line), 
    speaker = first(speaker), 
    utterance = paste(utterance, collapse = " "), 
    timestamp = paste(unlist(strsplit(timestamp, "[- ]+"))[c(1, n()*2)], collapse = " - "),
    .groups = "drop"
  ) %>%
  select(-notna)
# `summarise()` ungrouping output (override with `.groups` argument)
# # A tibble: 4 x 4
#   line  speaker utterance                                            timestamp                  
#   <chr> <chr>   <chr>                                                <chr>                      
# 1 0001  ID16.1  "ah-ha yes okay so where do we know each other from" 00:00:07.060 - 00:00:22.170
# 2 0004  ID16.2  "U uh Upper Rhine Cruises? maybe? "                  00:00:23.400 - 00:00:26.600
# 3 0005  ID16.3  "yeah? ((pause)) well I do n't-"                     00:00:26.305 - 00:00:28.210
# 4 0006  ID16.1  "(...) Meg? do you know Meg?"                        00:00:27.385 - 00:00:33.879

根據 R 中的條件合並字符串和時間戳

問題描述

1 個解決方案

解決方案1
1 已采納 2021-01-25 16:42:32

根據 R 中的條件合並字符串和時間戳

問題描述

1 個解決方案

解決方案1 1 已采納 2021-01-25 16:42:32

解決方案1
1 已采納 2021-01-25 16:42:32