R中的環視正則表達式模式

Question

我堅持創建正確的正則表達式模式，該模式將拆分我的數據框列的內容，而不會讓我失去任何元素。 我必須使用tidyr包中的tidyr separate()函數，因為這是較長處理管道的一部分。 由於我不想丟失字符串中的任何元素，因此我正在開發一個前瞻/后視表達式。

需要拆分的字符串可以遵循以下模式之一：

只有字母（例如'abcd'）
字母-破折號-數字（例如'abcd-123'）
字母數字（例如'abcd1234'）
列內容應最多分為 3 列，每組一列。

我想在每次元素更改時拆分，所以在字母和破折號之后。 可以有一個或多個字母、一個或多個數字，但只能有一個破折號。 只包含字母的字符串，不需要拆分。

這是我嘗試過的：

library(tidyr) 
myDat = data.frame(drugName = c("ab-1234", 'ab-1234', 'ab-1234',
                                'placebo', 'anotherdrug', 'andanother',
                                'xyz123', 'xyz123', 'placebo', 'another',
                                'omega-3', 'omega-3', 'another', 'placebo'))
drugColNames = paste0("X", 1:3) 

# This pattern doesn't split strings that only consist of number and letters, e.g. "xyz123" is not split after the letters.
pat = '(?=-[0-9+])|(?<=[a-z+]-)'

# This pattern splits at all the right places, but the last group (the numbers), is separated and not kept together.
# pat = '(?=-[0-9+]|[0-9+])|(?<=[a-z+]-)'

splitDat = separate(myDat, drugName,
         into = drugColNames,
         sep = pat)

拆分的輸出應該是：

"ab-1234" --> "ab" "-" "123"
"xyz123" --> "xyz" "123"
"omega-3" --> "omega" "-" "3"

非常感謝您在這方面提供幫助。 :)

Answer 1

在這里使用extract會更容易，因為我們沒有固定的分隔符，這也將避免使用正則表達式查找。

tidyr::extract(myDat, drugName, drugColNames, '([a-z]+)(-)?(\\d+)?', remove = FALSE)

#      drugName          X1 X2   X3
#1      ab-1234          ab  - 1234
#2      ab-1234          ab  - 1234
#3      ab-1234          ab  - 1234
#4      placebo     placebo        
#5  anotherdrug anotherdrug        
#6   andanother  andanother        
#7       xyz123         xyz     123
#8       xyz123         xyz     123
#9      placebo     placebo        
#10     another     another        
#11     omega-3       omega  -    3
#12     omega-3       omega  -    3
#13     another     another        
#14     placebo     placebo

Answer 2

您可以使用

> extract(myDat, "drugName",drugColNames, "^([[:alpha:]]+)(\\W*)(\\d*)$", remove=FALSE)
      drugName          X1 X2   X3
1      ab-1234          ab  - 1234
2      ab-1234          ab  - 1234
3      ab-1234          ab  - 1234
4      placebo     placebo        
5  anotherdrug anotherdrug        
6   andanother  andanother        
7       xyz123         xyz     123
8       xyz123         xyz     123
9      placebo     placebo        
10     another     another        
11     omega-3       omega  -    3
12     omega-3       omega  -    3
13     another     another        
14     placebo     placebo        
>

用於提取數據的正則表達式是

^([[:alpha:]]+)(\W*)(\d*)$

請參閱正則表達式演示。

細節

^ - 字符串的開始
([[:alpha:]]+) - 第 1 組（第X1列）：一個或多個字母
(\\W*) - 第 2 組（第X2列）：一個或多個非單詞字符
(\\d*) - 第 3 組（第X3列）：一位或多位數字
$ - 字符串的結尾。

要刪除原始列，請刪除remove=FALSE 。

R中的環視正則表達式模式

問題描述

2 個解決方案

解決方案1
3 已采納 2020-11-17 09:56:26

解決方案2
2 2020-11-17 09:57:24

R中的環視正則表達式模式

問題描述

2 個解決方案

解決方案1 3 已采納 2020-11-17 09:56:26

解決方案2 2 2020-11-17 09:57:24

解決方案1
3 已采納 2020-11-17 09:56:26

解決方案2
2 2020-11-17 09:57:24