[英]How do I gsub the complete time string behind @
(這是我的第一個問題,如果我需要改進它,請告訴我!)
我正在分析一個大型觀測數據集。 已指示每次觀察的開始和停止時間,以便我能夠計算持續時間。 但是有一個注釋欄,其中包含有關沒有看到動物的“暫停”/“休息”或“看不見”時期的信息。 我想從總持續時間中減去這些時間段。
我的問題是,一列包含幾個注釋,不僅是暫停(“HH:MM-HH:MM”),還包括某些事件的信息(xy 發生了“@HH:MM”)。
我只想查看格式為 HH:MM-HH:MM 的時間段,並且我想排除所有標記為“@HH:MM”的事件時間。 我設法刪除了所有單詞,只剩下數字,所以看起來像這樣
id <- c("3990", "3989", "3004")
timepoints <- c("@6:19,,7:16-7:23,7:25-7:43,@7:53,", "@6:19,,7:25-7:43,@7:53", "7:30-7:39,7:45-7:48,7:49-7:54")
df <- data.frame(id, timepoints)
嘗試了 grep 或 gsub 的幾種方法,試圖指示要保留哪個,或者忽略哪個,但我失敗了。 我得到的最接近的是 r 刪除“@HH”但保留“:MM”。 為此我使用
gsub("@([[:digit:]]|[_])*", "", df$timepoints)
,正如在此處發現的與單詞類似的問題: 從字符串中刪除所有以“@”開頭的單詞
目的是獲得(例如):
ID | 時間點 |
---|---|
3990 | “7:16-7:23、7:25-7:43” |
或者
ID | 時間點 |
---|---|
3990 | “7:16-7:23”、“7:25-7:43” |
如果可能的話,用逗號分隔,或者直接分成不同的列,這樣我就可以提取時間並將其從我的總觀察時間中減去。
任何幫助將不勝感激!
你可以這樣做:
f <- function(x) {
lapply(x, \(s) {
s = strsplit(s,",")[[1]]
s[grepl("^\\d",s)]
})
}
然后將 function 應用於時間點列
library(tidyverse)
mutate(df %>% as_tibble(), timepoints = f(timepoints)) %>%
unnest(timepoints)
Output:
id timepoints
<chr> <chr>
1 3990 7:16-7:23
2 3990 7:25-7:43
3 3989 7:25-7:43
4 3004 7:30-7:39
5 3004 7:45-7:48
6 3004 7:49-7:54
您還可以使用unnest_wider()
將這些作為列; 為此,我會調整我的f()
以包含時間點的名稱:
f <- function(x) {
lapply(x, \(s) {
s = strsplit(s,",")[[1]]
s = s[grepl("^\\d",s)]
setNames(s, paste0("tp", 1:length(s)))
})
}
library(tidyverse)
mutate(df %>% as_tibble(), timepoints = f(timepoints)) %>%
unnest_wider(timepoints)
Output:
id tp1 tp2 tp3
<chr> <chr> <chr> <chr>
1 3990 7:16-7:23 7:25-7:43 NA
2 3989 7:25-7:43 NA NA
3 3004 7:30-7:39 7:45-7:48 7:49-7:54
如何匹配您感興趣的字符串呢?
帶base
:
df$new_timepoints <- regmatches(df$timepoints, gregexpr("\\d{1,2}:\\d{2}-\\d{1,2}:\\d{2}", df$timepoints))
Output(帶列表欄):
id timepoints new_timepoints
1 3990 @6:19,,7:16-7:23,7:25-7:43,@7:53, 7:16-7:23, 7:25-7:43
2 3989 @6:19,,7:25-7:43,@7:53 7:25-7:43
3 3004 7:30-7:39,7:45-7:48,7:49-7:54 7:30-7:39, 7:45-7:48, 7:49-7:54
使用tidyverse
(為便於計算采用長格式:):
library(stringr)
library(dplyr)
library(tidyr)
df |>
group_by(id) |>
mutate(new_timepoints = str_extract_all(timepoints, "\\d{1,2}:\\d{2}-\\d{1,2}:\\d{2}")) |>
unnest_longer(new_timepoints) |>
ungroup()
Output:
# A tibble: 6 × 3
id timepoints new_timepoints
<chr> <chr> <chr>
1 3990 @6:19,,7:16-7:23,7:25-7:43,@7:53, 7:16-7:23
2 3990 @6:19,,7:16-7:23,7:25-7:43,@7:53, 7:25-7:43
3 3989 @6:19,,7:25-7:43,@7:53 7:25-7:43
4 3004 7:30-7:39,7:45-7:48,7:49-7:54 7:30-7:39
5 3004 7:30-7:39,7:45-7:48,7:49-7:54 7:45-7:48
6 3004 7:30-7:39,7:45-7:48,7:49-7:54 7:49-7:54
1) 用逗號列出拆分,然后用破折號列出 grep 的組件。 不使用任何包。 這給出了一個字符向量列表作為時間點列。
df2 <- df
df2$timepoints <- lapply(strsplit(df$timepoints, ","),
grep, pattern = "-", value = TRUE)
df2
## id timepoints
## 1 3990 7:16-7:23, 7:25-7:43
## 2 3989 7:25-7:43
## 3 3004 7:30-7:39, 7:45-7:48, 7:49-7:54
str(df2)
## 'data.frame': 3 obs. of 2 variables:
## $ id : chr "3990" "3989" "3004"
## $ timepoints:List of 3
## ..$ : chr "7:16-7:23" "7:25-7:43"
## ..$ : chr "7:25-7:43"
## ..$ : chr "7:30-7:39" "7:45-7:48" "7:49-7:54"
2)字符如果你想在每行中添加一個逗號分隔的字符串:
transform(df2, timepoints = sapply(timepoints, paste, collapse = ","))
## id timepoints
## 1 3990 7:16-7:23,7:25-7:43
## 2 3989 7:25-7:43
## 3 3004 7:30-7:39,7:45-7:48,7:49-7:54
3)長格式,或者如果您更喜歡長格式,請使用:
long <- with(df2, stack(setNames(timepoints, id))[2:1])
names(long) <- names(df2)
long
## id timepoints
## 1 3990 7:16-7:23
## 2 3990 7:25-7:43
## 3 3989 7:25-7:43
## 4 3004 7:30-7:39
## 5 3004 7:45-7:48
## 6 3004 7:49-7:54
4)寬格式或寬格式矩陣:
nr <- nrow(long)
L <- transform(long, seq = ave(1:nr, id, FUN = seq_along))
tapply(L$timepoints, L[c("id", "seq")], c)
## seq
## id 1 2 3
## 3990 "7:16-7:23" "7:25-7:43" NA
## 3989 "7:25-7:43" NA NA
## 3004 "7:30-7:39" "7:45-7:48" "7:49-7:54"
使用 package data.table
設置數據
library(data.table)
id <- c("3990", "3989", "3004")
timepoints <- c("@6:19,,7:16-7:23,7:25-7:43,@7:53,", "@6:19,,7:25-7:43,@7:53", "7:30-7:39,7:45-7:48,7:49-7:54")
df <- data.table(id, timepoints)
請注意,我將其保存為data.table
用逗號分割時間點並將值存儲在new_time
列中。
df[,new_time:=strsplit(timepoints, ",")]
刪除具有@
的字符串值
df[,new_time:=sapply(new_time, function(x) return(x[!grepl("[@]", x)]))]
由於timepoints
列在一行中有多個逗號空字符串( ""
)存在我刪除它們
df[,new_time:=sapply(new_time, function(x) return(x[!stringi::stri_isempty(x)]))]
現在new_time
列看起來像這樣
df$new_time
[[1]]
[1] "7:16-7:23" "7:25-7:43"
[[2]]
[1] "7:25-7:43"
[[3]]
[1] "7:30-7:39" "7:45-7:48" "7:49-7:54"
如果您想讓new_time
列包含整個字符串
df[,new_time:=sapply(new_time, paste, collapse=", ")]
df$new_time
[1] "7:16-7:23, 7:25-7:43" "7:25-7:43" "7:30-7:39, 7:45-7:48, 7:49-7:54"
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.