简体   繁体   English

如何在R中的两个单词之间的文本上进行gsub?

[英]How to gsub on the text between two words in R?

EDIT:编辑:

I would like to place a \\n before a specific unknown word in my text.我想在文本中的特定未知单词之前放置一个\\n I know that the first time the unknown word appears in my text will be between "Tree" and "Lake"我知道未知词第一次出现在我的文本中会在“树”和“湖”之间

Ex.前任。 of text:正文:

text
[1]  "TreeRULakeSunWater" 
[2]  "A B C D"

EDIT:编辑:

"Tree" and "Lake" will never change, but the word in between them is always changing so I do not look for "RU" in my regex “树”和“湖”永远不会改变,但它们之间的词总是在变化,所以我不会在我的regex寻找“RU”

What I am currently doing:我目前在做什么:

if (grepl(".*Tree\\s*|Lake.*",  text)) { text <- gsub(".*Tree\\s*|Lake.*", "\n\\1", text)}

The problem with what I am doing above is that the gsub will sub all of text and leave just \\nRU .我在上面所做的问题是gsub将所有text子化并只留下\\nRU

text
[1] "\nRU"

I have also tried:我也试过:

if (grepl(".*Tree *(.*?) *Lake.*",  text)) { text <- gsub(".*Tree *(.*?) *Lake.*", "\n\\1", text)}

What I would like text to look like after gsub :我希望textgsub之后的样子:

text
[1] "Tree \nRU LakeSunWater"
[2] "A B C D"

EDIT:编辑:

From Wiktor Stribizew's comment I am able to do a successful gsub根据 Wiktor Striizew 的评论,我能够成功执行gsub

gsub("Tree(\\w+)Lake", "Tree \n\\1 Lake", text)

But this will only do a gsub on occurrences where "RU" is between "Tree and "Lake", which is the first occurrence of the unknown word. The unknown word and in this case "RU" will show up many times in the text, and I would like to place \\n in front of every occurrence of "RU" when "RU" is a whole word.但这只会在 "RU" 介于 "Tree 和 "Lake" 之间的情况下执行 gsub,这是未知单词的第一次出现。未知单词和在这种情况下的 "RU" 将在文本中多次出现,当“RU”是一个完整的词时,我想将\\n放在每次出现的“RU”之前。

New Ex.新前任of text.的文本。

text
[1] "TreeRULakeSunWater"
[2] "A B C RU D"

New Ex.新前任of what I would like:我想要的:

text
[1] "Tree \nRU LakeSunWater"
[2] "A B C \nRU D"

Any help will be appreciated.任何帮助将不胜感激。 Please let me know if further information is needed.如果需要更多信息,请告诉我。

You need to find the unknown word between "Tree" and "Lake" first.您需要先找到“树”和“湖”之间的未知单词。 You can use您可以使用

unknown_word <- gsub(".*Tree(\\w+)Lake.*", "\\1", text)

The pattern matches any characters up to the last Tree in a string, then captures the unknown word ( \\w+ = one or more word characters) up to the Lake and then matches the rest of the string.该模式匹配字符串中直到最后一个Tree任何字符,然后捕获直到Lake的未知单词( \\w+ = 一个或多个单词字符),然后匹配字符串的其余部分。 It replaces all the strings in the vector.它替换向量中的所有字符串。 You can access the first one by [[1]] index.您可以通过[[1]]索引访问第一个。

Then, when you know the word, replace it with然后,当你知道这个词时,把它替换为

gsub(paste0("[[:space:]]*(", unknown_word[[1]], ")[[:space:]]*"), " \n\\1 ", text)

See IDEONE demo .请参阅IDEONE 演示

Here, you have [[:space:]]*( + unknown_word[ 1 ] + )[[:space:]]* pattern.在这里,您有[[:space:]]*( + unknown_word[ 1 ] + )[[:space:]]*模式。 It matches zero or more whitespaces on both ends of the unknown word, and the unknown word itself (captured into Group 1).它匹配未知单词两端的零个或多个空格,以及未知单词本身(捕获到组 1)。 In the replacement, the spaces are shrunk into 1 (or added if there were none) and then \\\\1 restores the unknown word.在替换中,空格被压缩为 1(如果没有则添加),然后\\\\1恢复未知单词。 You may replace [[:space:]] with \\\\s .你可以用\\\\s替换[[:space:]]

UPDATE更新

If you need to only add a newline symbols before RU that are whole words, use the \\b word boundary:如果您只需要在RU之前添加一个全字的换行符,请使用\\b字边界:

> gsub(paste0("[[:space:]]*\\b(", unknown_word[[1]], ")\\b[[:space:]]*"), " \n\\1 ", text)
[1] "TreeRULakeSunWater" "A B C \nRU D"   

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM