R：用多字符分隔符拆分字符串並保留分隔符

Question

嘗試解析 R 中一個相當復雜的字符串，這需要通過多字符向量拆分字符串，並在拆分前后保留分隔符的各個部分。

用文字來形容：

我有一個由多個條目組成的長字符串。 每個條目都以不同長度的數字開頭，后跟“\t”。
每個條目都包含多個段落，我也想拆分。 段落結尾遵循以下模式：字符、句號、字符（無空格）
我想拆分每個條目，將條目號保留在條目的開頭
我想拆分每個段落，將句點保留在第一段的末尾

input <- "1\tThis is a sentence. This is still part of the first paragraph.This is now the second paragraph.10\tThis is sentence number 1 of the tenth entry.This is the second sentence now. Still the second paragraph."

# desired output
[[1]] "1\tThis is a sentence. This is still part of the first paragraph."
[[2]] "This is now the second paragraph."
[[3]] "10t\This is sentence number 1 of the tenth entry."
[[4]] "This is the second paragraph now. Still the second paragraph."

我在這里找到了一些答案，但我無法將其擴展到多字符定界符。

Answer 1

這是一種可能的方法。
regex 中的\w是一個單詞字符，它將匹配字母、數字或下划線， (\\w\\.)(\\w)將搜索有“.”的模式。 在2個單詞字符之間，括號將此匹配分為2組可以引用。 "\\1###\\2"是替換模式，其中\1 & \2指的是上一場比賽中的正則表達式組。 所以它在應該進行拆分的地方添加了一個虛擬定界符。 然后我們可以按###拆分而不刪除任何原始內容。

input <- "1\tThis is a sentence. This is still part of the first paragraph.This is now the second paragraph.10\tThis is sentence number 1 of the tenth entry.This is the second sentence now. Still the second paragraph."
input |> gsub("(\\w\\.)(\\w)", "\\1###\\2", x = _) |> 
         strsplit("###", fixed = T) |> unlist()
#> [1] "1\tThis is a sentence. This is still part of the first paragraph."
#> [2] "This is now the second paragraph."                                
#> [3] "10\tThis is sentence number 1 of the tenth entry."                
#> [4] "This is the second sentence now. Still the second paragraph."

^{創建於 2023-01-21，使用reprex v2.0.2}

Answer 2

使用strsplit ，但在捕獲組上進行回顧。

strsplit(input, '(?<=(\\.(?=\\w)))', perl=TRUE) |> unlist()
# [1] "1\tThis is a sentence. This is still part of the first paragraph."
# [2] "This is now the second paragraph."                                
# [3] "10\tThis is sentence number 1 of the tenth entry."                
# [4] "This is the second sentence now. Still the second paragraph."

數據：

input <- "1\tThis is a sentence. This is still part of the first paragraph.This is now the second paragraph.10\tThis is sentence number 1 of the tenth entry.This is the second sentence now. Still the second paragraph."

R：用多字符分隔符拆分字符串並保留分隔符

問題描述

2 個解決方案

解決方案1
1 2023-01-20 22:45:45

解決方案2
1 2023-01-21 10:29:47

R：用多字符分隔符拆分字符串並保留分隔符

問題描述

2 個解決方案

解決方案1 1 2023-01-20 22:45:45

解決方案2 1 2023-01-21 10:29:47

解決方案1
1 2023-01-20 22:45:45

解決方案2
1 2023-01-21 10:29:47