繁体   English   中英

字符串根据特定条件在R中拆分

[英]String split in R based on certain criteria

我正在处理如下字符串

ID      Col1
------------------------------------------------------------------------------------
11         GLIPIZIDE  10 MG TAB 1 TABLET PO QAM
23         GLIPIZIDE  5 MG TAB 2 TABLETS PO BID
32         GLIPIZIDE  TAB PO
12         GLIPIZIDE  TAB PO PRN
343        PIOGLITAZONE [ACTOS] 45 MG TAB 1 TABLET PO DAILY #3 MONTHS SUPPLY REFILL X3
31        METFORMIN [GLUCOPHAGE XR] 500 MG TAB SR 24HR 2 TABLETS PO DAILY #200 TABLETS REFILL X3
44        METFORMIN [GLUCOPHAGE XR] 500 MG TAB SR 24HR 2 TABLETS PO DAILY #400 TABLETS REFILL X3
34        METFORMIN [GLUCOPHAGE XR] 500 MG TAB SR 24HR 2 TABLETS PO DAILY #200 TABLETS REFILL X3
38        METFORMIN [GLUCOPHAGE XR] 500 MG TAB SR 24HR 2 TABLETS PO DAILY #200 TABLETS REFILL X3

我想要完成的是两件事。

1) Store the first word a new column (Col2)
2) Search for the term "mg" and capture the string before the word "mg"
   and store that in a new column (Col3)

继续这个例子,最终输出应该是这样的

Id     Col2                  Col3  
---------------------------------
11     GLIPIZIDE             10 MG
23     GLIPIZIDE             5 MG
32     GLIPIZIDE             
12     GLIPIZIDE
343    PIOGLITAZONE          45 MG 
31     METFORMIN             500 MG
44     METFORMIN             500 MG
34     METFORMIN             500 MG
38     METFORMIN             500 MG

关于这个问题的任何帮助都很受欢迎。

数据

dd <- read.table(header = TRUE, stringsAsFactors = FALSE, text="ID      Col1
  11         'GLIPIZIDE  10 MG TAB 1 TABLET PO QAM'
23         'GLIPIZIDE  5 MG TAB 2 TABLETS PO BID'
32         'GLIPIZIDE  TAB PO'
12         'GLIPIZIDE  TAB PO PRN'
343        'PIOGLITAZONE [ACTOS] 45 MG TAB 1 TABLET PO DAILY #3 MONTHS SUPPLY REFILL X3'
31        'METFORMIN [GLUCOPHAGE XR] 500 MG TAB SR 24HR 2 TABLETS PO DAILY #200 TABLETS REFILL X3'
44        'METFORMIN [GLUCOPHAGE XR] 500 MG TAB SR 24HR 2 TABLETS PO DAILY #400 TABLETS REFILL X3'
34        'METFORMIN [GLUCOPHAGE XR] 500 MG TAB SR 24HR 2 TABLETS PO DAILY #200 TABLETS REFILL X3'
38        'METFORMIN [GLUCOPHAGE XR] 500 MG TAB SR 24HR 2 TABLETS PO DAILY #200 TABLETS REFILL X3'")

一个是使用两个正则表达式1)捕获字符串开头的第一个单词( ^\\\\w+ )和2)找到数字后跟“mg”( \\\\d+ mg

dd <- read.table(header = TRUE, stringsAsFactors = FALSE, text="ID      Col1
  11         'GLIPIZIDE  10 MG TAB 1 TABLET PO QAM'
23         'GLIPIZIDE  5 MG TAB 2 TABLETS PO BID'
32         'GLIPIZIDE  TAB PO'
12         'GLIPIZIDE  TAB PO PRN'
343        'PIOGLITAZONE [ACTOS] 45 MG TAB 1 TABLET PO DAILY #3 MONTHS SUPPLY REFILL X3'
31        'METFORMIN [GLUCOPHAGE XR] 500 MG TAB SR 24HR 2 TABLETS PO DAILY #200 TABLETS REFILL X3'
44        'METFORMIN [GLUCOPHAGE XR] 500 MG TAB SR 24HR 2 TABLETS PO DAILY #400 TABLETS REFILL X3'
34        'METFORMIN [GLUCOPHAGE XR] 500 MG TAB SR 24HR 2 TABLETS PO DAILY #200 TABLETS REFILL X3'
38        'METFORMIN [GLUCOPHAGE XR] 500 MG TAB SR 24HR 2 TABLETS PO DAILY #200 TABLETS REFILL X3'")



within(dd, {
  col1 <- gsub('(^\\w+)|.', '\\1', Col1)
  dose <- gsub('(?i)(\\d+ mg)|.', '\\1', Col1)
})[, c('col1','dose')]

#           col1   dose
# 1    GLIPIZIDE  10 MG
# 2    GLIPIZIDE   5 MG
# 3    GLIPIZIDE       
# 4    GLIPIZIDE       
# 5 PIOGLITAZONE  45 MG
# 6    METFORMIN 500 MG
# 7    METFORMIN 500 MG
# 8    METFORMIN 500 MG
# 9    METFORMIN 500 MG

这是一个使用stringi的方法

library(stringi)
ss <- stri_extract_all_regex(dd$Col1, "(?i)(^\\w+)|(\\d+ mg)", simplify = TRUE)
setNames(cbind(dd[1], ss), c("ID", "Col2", "Col3")))
#    ID         Col2   Col3
# 1  11    GLIPIZIDE  10 MG
# 2  23    GLIPIZIDE   5 MG
# 3  32    GLIPIZIDE       
# 4  12    GLIPIZIDE       
# 5 343 PIOGLITAZONE  45 MG
# 6  31    METFORMIN 500 MG
# 7  44    METFORMIN 500 MG
# 8  34    METFORMIN 500 MG
# 9  38    METFORMIN 500 MG

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM