I am dealing with strings that are as follows
ID Col1
------------------------------------------------------------------------------------
11 GLIPIZIDE 10 MG TAB 1 TABLET PO QAM
23 GLIPIZIDE 5 MG TAB 2 TABLETS PO BID
32 GLIPIZIDE TAB PO
12 GLIPIZIDE TAB PO PRN
343 PIOGLITAZONE [ACTOS] 45 MG TAB 1 TABLET PO DAILY #3 MONTHS SUPPLY REFILL X3
31 METFORMIN [GLUCOPHAGE XR] 500 MG TAB SR 24HR 2 TABLETS PO DAILY #200 TABLETS REFILL X3
44 METFORMIN [GLUCOPHAGE XR] 500 MG TAB SR 24HR 2 TABLETS PO DAILY #400 TABLETS REFILL X3
34 METFORMIN [GLUCOPHAGE XR] 500 MG TAB SR 24HR 2 TABLETS PO DAILY #200 TABLETS REFILL X3
38 METFORMIN [GLUCOPHAGE XR] 500 MG TAB SR 24HR 2 TABLETS PO DAILY #200 TABLETS REFILL X3
What I want to accomplish is two things.
1) Store the first word a new column (Col2)
2) Search for the term "mg" and capture the string before the word "mg"
and store that in a new column (Col3)
Going with the example, the final output should like this
Id Col2 Col3
---------------------------------
11 GLIPIZIDE 10 MG
23 GLIPIZIDE 5 MG
32 GLIPIZIDE
12 GLIPIZIDE
343 PIOGLITAZONE 45 MG
31 METFORMIN 500 MG
44 METFORMIN 500 MG
34 METFORMIN 500 MG
38 METFORMIN 500 MG
Any help on this issue is much appriciated.
Data
dd <- read.table(header = TRUE, stringsAsFactors = FALSE, text="ID Col1
11 'GLIPIZIDE 10 MG TAB 1 TABLET PO QAM'
23 'GLIPIZIDE 5 MG TAB 2 TABLETS PO BID'
32 'GLIPIZIDE TAB PO'
12 'GLIPIZIDE TAB PO PRN'
343 'PIOGLITAZONE [ACTOS] 45 MG TAB 1 TABLET PO DAILY #3 MONTHS SUPPLY REFILL X3'
31 'METFORMIN [GLUCOPHAGE XR] 500 MG TAB SR 24HR 2 TABLETS PO DAILY #200 TABLETS REFILL X3'
44 'METFORMIN [GLUCOPHAGE XR] 500 MG TAB SR 24HR 2 TABLETS PO DAILY #400 TABLETS REFILL X3'
34 'METFORMIN [GLUCOPHAGE XR] 500 MG TAB SR 24HR 2 TABLETS PO DAILY #200 TABLETS REFILL X3'
38 'METFORMIN [GLUCOPHAGE XR] 500 MG TAB SR 24HR 2 TABLETS PO DAILY #200 TABLETS REFILL X3'")
One was is to use two regexes to 1) capture the first word at the beginning of the string ( ^\\\\w+
) and 2) find digits followed by "mg" ( \\\\d+ mg
)
dd <- read.table(header = TRUE, stringsAsFactors = FALSE, text="ID Col1
11 'GLIPIZIDE 10 MG TAB 1 TABLET PO QAM'
23 'GLIPIZIDE 5 MG TAB 2 TABLETS PO BID'
32 'GLIPIZIDE TAB PO'
12 'GLIPIZIDE TAB PO PRN'
343 'PIOGLITAZONE [ACTOS] 45 MG TAB 1 TABLET PO DAILY #3 MONTHS SUPPLY REFILL X3'
31 'METFORMIN [GLUCOPHAGE XR] 500 MG TAB SR 24HR 2 TABLETS PO DAILY #200 TABLETS REFILL X3'
44 'METFORMIN [GLUCOPHAGE XR] 500 MG TAB SR 24HR 2 TABLETS PO DAILY #400 TABLETS REFILL X3'
34 'METFORMIN [GLUCOPHAGE XR] 500 MG TAB SR 24HR 2 TABLETS PO DAILY #200 TABLETS REFILL X3'
38 'METFORMIN [GLUCOPHAGE XR] 500 MG TAB SR 24HR 2 TABLETS PO DAILY #200 TABLETS REFILL X3'")
within(dd, {
col1 <- gsub('(^\\w+)|.', '\\1', Col1)
dose <- gsub('(?i)(\\d+ mg)|.', '\\1', Col1)
})[, c('col1','dose')]
# col1 dose
# 1 GLIPIZIDE 10 MG
# 2 GLIPIZIDE 5 MG
# 3 GLIPIZIDE
# 4 GLIPIZIDE
# 5 PIOGLITAZONE 45 MG
# 6 METFORMIN 500 MG
# 7 METFORMIN 500 MG
# 8 METFORMIN 500 MG
# 9 METFORMIN 500 MG
Here's a go with stringi .
library(stringi)
ss <- stri_extract_all_regex(dd$Col1, "(?i)(^\\w+)|(\\d+ mg)", simplify = TRUE)
setNames(cbind(dd[1], ss), c("ID", "Col2", "Col3")))
# ID Col2 Col3
# 1 11 GLIPIZIDE 10 MG
# 2 23 GLIPIZIDE 5 MG
# 3 32 GLIPIZIDE
# 4 12 GLIPIZIDE
# 5 343 PIOGLITAZONE 45 MG
# 6 31 METFORMIN 500 MG
# 7 44 METFORMIN 500 MG
# 8 34 METFORMIN 500 MG
# 9 38 METFORMIN 500 MG
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.