简体   繁体   中英

R: Extracting a substring consisting of a number followed by a pattern (or a space and a pattern) without extracting other numbers

I am extracting the amount of tablets(or capsules) which are in a container from a dataset. The amount and form is in a string in the description column, which contains a lot of other information. I would like to extract the amount and the word which specifies what form it is in(TABLET or TAB or CAPSULE etc.). I have tried the following so far.

testdescript = c("CARBAMAZEPINE EXTENDED RELEASE TABLETS USP400 MG DRG LIC NO TLCT17HZ2019201757131 DT3042020 100 TABS", "100MGCARBAMAZEPINECARBATOL 100 TABLET CARBAM", "TEGRITAL CR400 x10TAB CARBAMAZEPINE10", "TEGRITAL200 CARBAMAZEPINE200 100 TAB","CARBAMAZEPINE300 MG X120 CAPSULES FOR RESEARCH PURPOSE ONLY NCV") 
pattern = c("([0-9/]+[[:space:]])+TABS", " [0-9/]+TABS", "([0-9/]+[[:space:]])+TABLET","[0-9/]+TABLET", "[0-9/]+[[:space:]]+TAB", "[0-9/]+TAB","([0-9/]+[[:space:]])+CAPSULES","[0-9/]+CAPSULES")
str_extract(testdescript,paste0(pattern, collapse = '|'))
##which gives
[1] "3042020 100 TABS" "200 TABLET"       "10TAB"            "100 TAB"          "120 CAPSULES"  

The last 4 results are the desired outcome so only the number of tablets and the word specifying tablets are extracted. The first result gives 2 numbers for which the first one (3042020) is not desired. The desired outcome would be 100 TABS. I have also tried the following code for the pattern, which gave a similar outcome but with an additional mistake in the fourth result( 200 100 TAB).

pattern2 =c("([0-9/]|([0-9/]+[[:space:]]))+TABS", "([0-9/]|([0-9/]+[[:space:]]))+TABLET","([0-9/]|([0-9/]+[[:space:]]))+TAB", "([0-9/]|([0-9/]+[[:space:]]))+CAPSULES")
str_extract(testdescript,paste0(pattern2, collapse = '|'))
[1] "3042020 100 TABS" "100 TABLET"       "10TAB"            "200 100 TAB"      "120 CAPSULES"   

My question is: how do I get the number and the form text which may contain spaces without reading in undesired other numbers?

Thanks in advance!

I think the pattern you are looking for is the following :

str_extract(string = testdescript, pattern = "[0-9]+ ?(TABS?|TABLETS?|CAPSULES?)")

To explain the pattern above it's looking for a series of number that could be followed by a space or not that's why there is a ? after the space. And after that i'm looking for the words tab, tablet and capsule again testing if they have an S or not.

这对您的数据集来说太简单了吗?

str_extract(testdescript, "(?<=[[:space:]]|x|X)[0-9]+[[:space:]]?(TAB|TABS|TABLET|CAPSULES)")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM