[英]Several find and replace from text file
I have a text file which want to convert it to data frame.我有一个文本文件想要将其转换为数据框。 The text is messy, and needs cleaning, removing a couple of repetitive sentences, replace new line (the wildcard in word is "^p" to tab or comma and...
文本很乱,需要清理,删除几个重复的句子,替换新行(单词中的通配符是“^p”到制表符或逗号...
for example my text file is like:例如我的文本文件是这样的:
-The data 1 is taken on Aug, 2009 at UBC
and is significant with p value <0.01
-The data 2 is taken on Sep, 2012 at SFU
and is not significant with p value > 0.06
how can I can I do multiple find and replace.我怎样才能进行多次查找和替换。 I used this code:
我使用了这段代码:
tx = readLines("My_text.txt")
tx2 = gsub(pattern = "is taken on", replace = " ", x = tx)
tx3 = gsub(pattern = "at", replace = " ", x = tx2)
writeLines(tx3, con="tx3.txt")
But I do not know how can I replace "at" to tab (^t) or how can I replace (^p) with, or for example space^p ( ^p) with,但我不知道如何将“at”替换为制表符 (^t),或者如何将 (^p) 替换为,或者例如 space^p (^p) 替换为,
Use regular expressions to take account for word boundaries \\b
.使用正则表达式来考虑单词边界
\\b
。
To avoid multiple gsub()
we could use a replacement matrix rmx
.为了避免多个
gsub()
我们可以使用替换矩阵rmx
。
rmx <- matrix(c("\\sis taken on\\s\\b", " ",
"\\b\\sat\\s", "\t" # replace with tab
), 2)
# [ ,1] [,2]
# [1,] "\\sis taken on\\s\\b" "\\b\\sat\\s"
# [2,] " " "\t"
Now we may feed gsub()
with rmx
column by column using apply()
.现在我们可以使用
apply()
逐列为gsub()
提供rmx
。 To make permanent changes to tx
we can use the <<-
operator.要对
tx
进行永久性更改,我们可以使用<<-
运算符。 To avoid spamming the console, we could wrap the whole thing with an invisible()
.为了避免向控制台发送垃圾邮件,我们可以用一个
invisible()
包裹整个东西。
tx <- readLines("My_text.txt")
invisible(
apply(rmx, MARGIN=2, function(x) tx <<- gsub(x[1], x[2], tx))
)
To get continuous text rather than paragraphs (what I assume you mean by ^p
-replacement) we could simply paste()
the result, collapse
ing by ,
.为了获得连续的文本而不是段落(我假设你的意思是
^p
-replacement),我们可以简单地paste()
结果,用,
collapse
。 The empty strings should be filtered out with tx != ""
.应该使用
tx != ""
过滤掉空字符串。
tx <- paste(tx[tx != ""], collapse=", ")
Now writeLines()
.现在
writeLines()
。
writeLines(tx, con="tx4.txt")
Result结果
-The data 1 Aug, 2009 UBC, and is significant with p value <0.01, -The data 2 Sep, 2012 SFU, and is not significant with p value > 0.06
- 2009 年 8 月 1 日 UBC 的数据,且 p 值 <0.01 显着, - 2012 年 9 月 2 日 SFU 的数据,且 p 值 > 0.06 不显着
Appendix附录
We also may replace special characters in R by double-escape them – read this post .我们也可以通过双重转义替换 R 中的特殊字符——阅读这篇文章。
gsub("\\$", "\t", "today$is$monday")
# [1] "today\tis\tmonday"
Using the regex supplied by jay.sf, you could use str_replace_all
from the stringr
package to do it with a named vector.使用 jay.sf 提供的正则表达式,您可以使用
stringr
package 中的str_replace_all
来处理命名向量。
library(stringr)
new_tx <- str_replace_all(tx,
c("\\sis taken on\\s" = " ",
"\\b\\sat\\s" = "\t",
"\\b\\sp\\b" = ","))
cat(new_tx)
Result结果
-The data 1 Aug, 2009 UBC
and is significant with, value <0.01
-The data 2 Sep, 2012 SFU
and is not significant with, value > 0.06
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.