简体   繁体   中英

How to gsub on the text between two words in R?

EDIT:

I would like to place a \\n before a specific unknown word in my text. I know that the first time the unknown word appears in my text will be between "Tree" and "Lake"

Ex. of text:

text
[1]  "TreeRULakeSunWater" 
[2]  "A B C D"

EDIT:

"Tree" and "Lake" will never change, but the word in between them is always changing so I do not look for "RU" in my regex

What I am currently doing:

if (grepl(".*Tree\\s*|Lake.*",  text)) { text <- gsub(".*Tree\\s*|Lake.*", "\n\\1", text)}

The problem with what I am doing above is that the gsub will sub all of text and leave just \\nRU .

text
[1] "\nRU"

I have also tried:

if (grepl(".*Tree *(.*?) *Lake.*",  text)) { text <- gsub(".*Tree *(.*?) *Lake.*", "\n\\1", text)}

What I would like text to look like after gsub :

text
[1] "Tree \nRU LakeSunWater"
[2] "A B C D"

EDIT:

From Wiktor Stribizew's comment I am able to do a successful gsub

gsub("Tree(\\w+)Lake", "Tree \n\\1 Lake", text)

But this will only do a gsub on occurrences where "RU" is between "Tree and "Lake", which is the first occurrence of the unknown word. The unknown word and in this case "RU" will show up many times in the text, and I would like to place \\n in front of every occurrence of "RU" when "RU" is a whole word.

New Ex.of text.

text
[1] "TreeRULakeSunWater"
[2] "A B C RU D"

New Ex.of what I would like:

text
[1] "Tree \nRU LakeSunWater"
[2] "A B C \nRU D"

Any help will be appreciated. Please let me know if further information is needed.

You need to find the unknown word between "Tree" and "Lake" first. You can use

unknown_word <- gsub(".*Tree(\\w+)Lake.*", "\\1", text)

The pattern matches any characters up to the last Tree in a string, then captures the unknown word ( \\w+ = one or more word characters) up to the Lake and then matches the rest of the string. It replaces all the strings in the vector. You can access the first one by [[1]] index.

Then, when you know the word, replace it with

gsub(paste0("[[:space:]]*(", unknown_word[[1]], ")[[:space:]]*"), " \n\\1 ", text)

See IDEONE demo .

Here, you have [[:space:]]*( + unknown_word[ 1 ] + )[[:space:]]* pattern. It matches zero or more whitespaces on both ends of the unknown word, and the unknown word itself (captured into Group 1). In the replacement, the spaces are shrunk into 1 (or added if there were none) and then \\\\1 restores the unknown word. You may replace [[:space:]] with \\\\s .

UPDATE

If you need to only add a newline symbols before RU that are whole words, use the \\b word boundary:

> gsub(paste0("[[:space:]]*\\b(", unknown_word[[1]], ")\\b[[:space:]]*"), " \n\\1 ", text)
[1] "TreeRULakeSunWater" "A B C \nRU D"   

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM