[英]R: split string by multi-character delimiter and keep the delimiter
嘗試解析 R 中一個相當復雜的字符串,這需要通過多字符向量拆分字符串,並在拆分前后保留分隔符的各個部分。
用文字來形容:
input <- "1\tThis is a sentence. This is still part of the first paragraph.This is now the second paragraph.10\tThis is sentence number 1 of the tenth entry.This is the second sentence now. Still the second paragraph."
# desired output
[[1]] "1\tThis is a sentence. This is still part of the first paragraph."
[[2]] "This is now the second paragraph."
[[3]] "10t\This is sentence number 1 of the tenth entry."
[[4]] "This is the second paragraph now. Still the second paragraph."
我在這里找到了一些答案,但我無法將其擴展到多字符定界符。
這是一種可能的方法。
regex 中的\w
是一個單詞字符,它將匹配字母、數字或下划線, (\\w\\.)(\\w)
將搜索有“.”的模式。 在2個單詞字符之間,括號將此匹配分為2組可以引用。 "\\1###\\2"
是替換模式,其中\1
& \2
指的是上一場比賽中的正則表達式組。 所以它在應該進行拆分的地方添加了一個虛擬定界符。 然后我們可以按###
拆分而不刪除任何原始內容。
input <- "1\tThis is a sentence. This is still part of the first paragraph.This is now the second paragraph.10\tThis is sentence number 1 of the tenth entry.This is the second sentence now. Still the second paragraph."
input |> gsub("(\\w\\.)(\\w)", "\\1###\\2", x = _) |>
strsplit("###", fixed = T) |> unlist()
#> [1] "1\tThis is a sentence. This is still part of the first paragraph."
#> [2] "This is now the second paragraph."
#> [3] "10\tThis is sentence number 1 of the tenth entry."
#> [4] "This is the second sentence now. Still the second paragraph."
創建於 2023-01-21,使用reprex v2.0.2
使用strsplit
,但在捕獲組上進行回顧。
strsplit(input, '(?<=(\\.(?=\\w)))', perl=TRUE) |> unlist()
# [1] "1\tThis is a sentence. This is still part of the first paragraph."
# [2] "This is now the second paragraph."
# [3] "10\tThis is sentence number 1 of the tenth entry."
# [4] "This is the second sentence now. Still the second paragraph."
數據:
input <- "1\tThis is a sentence. This is still part of the first paragraph.This is now the second paragraph.10\tThis is sentence number 1 of the tenth entry.This is the second sentence now. Still the second paragraph."
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.