幾個從文本文件中查找和替換

Question

我有一個文本文件想要將其轉換為數據框。 文本很亂，需要清理，刪除幾個重復的句子，替換新行（單詞中的通配符是“^p”到制表符或逗號...

例如我的文本文件是這樣的：

-The data 1 is taken on Aug, 2009 at UBC
and is significant with p value <0.01

-The data 2 is taken on Sep, 2012 at SFU
and is  not significant with p value > 0.06

我怎樣才能進行多次查找和替換。 我使用了這段代碼：

tx = readLines("My_text.txt")
tx2 = gsub(pattern = "is taken on", replace = " ", x = tx)
tx3 = gsub(pattern = "at", replace = " ", x = tx2)
writeLines(tx3, con="tx3.txt")

但我不知道如何將“at”替換為制表符 (^t)，或者如何將 (^p) 替換為，或者例如 space^p (^p) 替換為，

Answer 1

使用正則表達式來考慮單詞邊界\\b 。

為了避免多個gsub()我們可以使用替換矩陣rmx 。

rmx <- matrix(c("\\sis taken on\\s\\b", " ",  
                "\\b\\sat\\s", "\t"          #  replace with tab
                ), 2)        
#      [         ,1]                   [,2]         
# [1,] "\\sis taken on\\s\\b" "\\b\\sat\\s"
# [2,] " "                    "\t"

現在我們可以使用apply()逐列為gsub()提供rmx 。 要對tx進行永久性更改，我們可以使用<<-運算符。 為了避免向控制台發送垃圾郵件，我們可以用一個invisible()包裹整個東西。

tx <- readLines("My_text.txt")
invisible(
  apply(rmx, MARGIN=2, function(x) tx <<- gsub(x[1], x[2], tx))
  )

為了獲得連續的文本而不是段落（我假設你的意思是^p -replacement），我們可以簡單地paste()結果，用, collapse 。 應該使用tx != ""過濾掉空字符串。

tx <- paste(tx[tx != ""], collapse=", ")

現在writeLines() 。

writeLines(tx, con="tx4.txt")

結果

- 2009 年 8 月 1 日 UBC 的數據，且 p 值 <0.01 顯着， - 2012 年 9 月 2 日 SFU 的數據，且 p 值 > 0.06 不顯着

附錄

我們也可以通過雙重轉義替換 R 中的特殊字符——閱讀這篇文章。

gsub("\\$", "\t", "today$is$monday")
# [1] "today\tis\tmonday"

Answer 2

使用 jay.sf 提供的正則表達式，您可以使用stringr package 中的str_replace_all來處理命名向量。

library(stringr)

new_tx <- str_replace_all(tx,
                          c("\\sis taken on\\s" = " ",
                            "\\b\\sat\\s" = "\t",
                            "\\b\\sp\\b" = ","))

cat(new_tx)

結果

-The data 1 Aug, 2009    UBC
and is significant with, value <0.01

-The data 2 Sep, 2012    SFU
and is  not significant with, value > 0.06

幾個從文本文件中查找和替換

問題描述

2 個解決方案

解決方案1
2 2019-10-06 06:57:22

解決方案2
1 已采納 2019-10-06 09:29:01

幾個從文本文件中查找和替換

問題描述

2 個解決方案

解決方案1 2 2019-10-06 06:57:22

解決方案2 1 已采納 2019-10-06 09:29:01

解決方案1
2 2019-10-06 06:57:22

解決方案2
1 已采納 2019-10-06 09:29:01