簡體   English   中英

R:用多字符分隔符拆分字符串並保留分隔符

[英]R: split string by multi-character delimiter and keep the delimiter

嘗試解析 R 中一個相當復雜的字符串,這需要通過多字符向量拆分字符串,並在拆分前后保留分隔符的各個部分。

用文字來形容:

  • 我有一個由多個條目組成的長字符串。 每個條目都以不同長度的數字開頭,后跟“\t”。
  • 每個條目都包含多個段落,我也想拆分。 段落結尾遵循以下模式:字符、句號、字符(無空格)
  • 我想拆分每個條目,將條目號保留在條目的開頭
  • 我想拆分每個段落,將句點保留在第一段的末尾
input <- "1\tThis is a sentence. This is still part of the first paragraph.This is now the second paragraph.10\tThis is sentence number 1 of the tenth entry.This is the second sentence now. Still the second paragraph."

# desired output
[[1]] "1\tThis is a sentence. This is still part of the first paragraph."
[[2]] "This is now the second paragraph."
[[3]] "10t\This is sentence number 1 of the tenth entry."
[[4]] "This is the second paragraph now. Still the second paragraph."

我在這里找到了一些答案,但我無法將其擴展到多字符定界符。

這是一種可能的方法。
regex 中的\w是一個單詞字符,它將匹配字母、數字或下划線, (\\w\\.)(\\w)將搜索有“.”的模式。 在2個單詞字符之間,括號將此匹配分為2組可以引用。 "\\1###\\2"是替換模式,其中\1 & \2指的是上一場比賽中的正則表達式組。 所以它在應該進行拆分的地方添加了一個虛擬定界符。 然后我們可以按###拆分而不刪除任何原始內容。

input <- "1\tThis is a sentence. This is still part of the first paragraph.This is now the second paragraph.10\tThis is sentence number 1 of the tenth entry.This is the second sentence now. Still the second paragraph."
input |> gsub("(\\w\\.)(\\w)", "\\1###\\2", x = _) |> 
         strsplit("###", fixed = T) |> unlist()
#> [1] "1\tThis is a sentence. This is still part of the first paragraph."
#> [2] "This is now the second paragraph."                                
#> [3] "10\tThis is sentence number 1 of the tenth entry."                
#> [4] "This is the second sentence now. Still the second paragraph."

創建於 2023-01-21,使用reprex v2.0.2

使用strsplit ,但在捕獲組上進行回顧。

strsplit(input, '(?<=(\\.(?=\\w)))', perl=TRUE) |> unlist()
# [1] "1\tThis is a sentence. This is still part of the first paragraph."
# [2] "This is now the second paragraph."                                
# [3] "10\tThis is sentence number 1 of the tenth entry."                
# [4] "This is the second sentence now. Still the second paragraph." 

數據:

input <- "1\tThis is a sentence. This is still part of the first paragraph.This is now the second paragraph.10\tThis is sentence number 1 of the tenth entry.This is the second sentence now. Still the second paragraph."

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM