简体   繁体   中英

String split in R based on certain criteria

I am dealing with strings that are as follows

ID      Col1
------------------------------------------------------------------------------------
11         GLIPIZIDE  10 MG TAB 1 TABLET PO QAM
23         GLIPIZIDE  5 MG TAB 2 TABLETS PO BID
32         GLIPIZIDE  TAB PO
12         GLIPIZIDE  TAB PO PRN
343        PIOGLITAZONE [ACTOS] 45 MG TAB 1 TABLET PO DAILY #3 MONTHS SUPPLY REFILL X3
31        METFORMIN [GLUCOPHAGE XR] 500 MG TAB SR 24HR 2 TABLETS PO DAILY #200 TABLETS REFILL X3
44        METFORMIN [GLUCOPHAGE XR] 500 MG TAB SR 24HR 2 TABLETS PO DAILY #400 TABLETS REFILL X3
34        METFORMIN [GLUCOPHAGE XR] 500 MG TAB SR 24HR 2 TABLETS PO DAILY #200 TABLETS REFILL X3
38        METFORMIN [GLUCOPHAGE XR] 500 MG TAB SR 24HR 2 TABLETS PO DAILY #200 TABLETS REFILL X3

What I want to accomplish is two things.

1) Store the first word a new column (Col2)
2) Search for the term "mg" and capture the string before the word "mg"
   and store that in a new column (Col3)

Going with the example, the final output should like this

Id     Col2                  Col3  
---------------------------------
11     GLIPIZIDE             10 MG
23     GLIPIZIDE             5 MG
32     GLIPIZIDE             
12     GLIPIZIDE
343    PIOGLITAZONE          45 MG 
31     METFORMIN             500 MG
44     METFORMIN             500 MG
34     METFORMIN             500 MG
38     METFORMIN             500 MG

Any help on this issue is much appriciated.

Data

dd <- read.table(header = TRUE, stringsAsFactors = FALSE, text="ID      Col1
  11         'GLIPIZIDE  10 MG TAB 1 TABLET PO QAM'
23         'GLIPIZIDE  5 MG TAB 2 TABLETS PO BID'
32         'GLIPIZIDE  TAB PO'
12         'GLIPIZIDE  TAB PO PRN'
343        'PIOGLITAZONE [ACTOS] 45 MG TAB 1 TABLET PO DAILY #3 MONTHS SUPPLY REFILL X3'
31        'METFORMIN [GLUCOPHAGE XR] 500 MG TAB SR 24HR 2 TABLETS PO DAILY #200 TABLETS REFILL X3'
44        'METFORMIN [GLUCOPHAGE XR] 500 MG TAB SR 24HR 2 TABLETS PO DAILY #400 TABLETS REFILL X3'
34        'METFORMIN [GLUCOPHAGE XR] 500 MG TAB SR 24HR 2 TABLETS PO DAILY #200 TABLETS REFILL X3'
38        'METFORMIN [GLUCOPHAGE XR] 500 MG TAB SR 24HR 2 TABLETS PO DAILY #200 TABLETS REFILL X3'")

One was is to use two regexes to 1) capture the first word at the beginning of the string ( ^\\\\w+ ) and 2) find digits followed by "mg" ( \\\\d+ mg )

dd <- read.table(header = TRUE, stringsAsFactors = FALSE, text="ID      Col1
  11         'GLIPIZIDE  10 MG TAB 1 TABLET PO QAM'
23         'GLIPIZIDE  5 MG TAB 2 TABLETS PO BID'
32         'GLIPIZIDE  TAB PO'
12         'GLIPIZIDE  TAB PO PRN'
343        'PIOGLITAZONE [ACTOS] 45 MG TAB 1 TABLET PO DAILY #3 MONTHS SUPPLY REFILL X3'
31        'METFORMIN [GLUCOPHAGE XR] 500 MG TAB SR 24HR 2 TABLETS PO DAILY #200 TABLETS REFILL X3'
44        'METFORMIN [GLUCOPHAGE XR] 500 MG TAB SR 24HR 2 TABLETS PO DAILY #400 TABLETS REFILL X3'
34        'METFORMIN [GLUCOPHAGE XR] 500 MG TAB SR 24HR 2 TABLETS PO DAILY #200 TABLETS REFILL X3'
38        'METFORMIN [GLUCOPHAGE XR] 500 MG TAB SR 24HR 2 TABLETS PO DAILY #200 TABLETS REFILL X3'")



within(dd, {
  col1 <- gsub('(^\\w+)|.', '\\1', Col1)
  dose <- gsub('(?i)(\\d+ mg)|.', '\\1', Col1)
})[, c('col1','dose')]

#           col1   dose
# 1    GLIPIZIDE  10 MG
# 2    GLIPIZIDE   5 MG
# 3    GLIPIZIDE       
# 4    GLIPIZIDE       
# 5 PIOGLITAZONE  45 MG
# 6    METFORMIN 500 MG
# 7    METFORMIN 500 MG
# 8    METFORMIN 500 MG
# 9    METFORMIN 500 MG

Here's a go with stringi .

library(stringi)
ss <- stri_extract_all_regex(dd$Col1, "(?i)(^\\w+)|(\\d+ mg)", simplify = TRUE)
setNames(cbind(dd[1], ss), c("ID", "Col2", "Col3")))
#    ID         Col2   Col3
# 1  11    GLIPIZIDE  10 MG
# 2  23    GLIPIZIDE   5 MG
# 3  32    GLIPIZIDE       
# 4  12    GLIPIZIDE       
# 5 343 PIOGLITAZONE  45 MG
# 6  31    METFORMIN 500 MG
# 7  44    METFORMIN 500 MG
# 8  34    METFORMIN 500 MG
# 9  38    METFORMIN 500 MG

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM