简体   繁体   English

R:提取由数字后跟模式(或空格和模式)组成的子字符串,而不提取其他数字

[英]R: Extracting a substring consisting of a number followed by a pattern (or a space and a pattern) without extracting other numbers

I am extracting the amount of tablets(or capsules) which are in a container from a dataset.我正在从数据集中提取容器中的药片(或胶囊)的数量。 The amount and form is in a string in the description column, which contains a lot of other information.金额和形式在描述列中的字符串中,其中包含很多其他信息。 I would like to extract the amount and the word which specifies what form it is in(TABLET or TAB or CAPSULE etc.).我想提取数量和指定它是什么形式的单词(平板电脑或标签或胶囊等)。 I have tried the following so far.到目前为止,我已经尝试了以下方法。

testdescript = c("CARBAMAZEPINE EXTENDED RELEASE TABLETS USP400 MG DRG LIC NO TLCT17HZ2019201757131 DT3042020 100 TABS", "100MGCARBAMAZEPINECARBATOL 100 TABLET CARBAM", "TEGRITAL CR400 x10TAB CARBAMAZEPINE10", "TEGRITAL200 CARBAMAZEPINE200 100 TAB","CARBAMAZEPINE300 MG X120 CAPSULES FOR RESEARCH PURPOSE ONLY NCV") 
pattern = c("([0-9/]+[[:space:]])+TABS", " [0-9/]+TABS", "([0-9/]+[[:space:]])+TABLET","[0-9/]+TABLET", "[0-9/]+[[:space:]]+TAB", "[0-9/]+TAB","([0-9/]+[[:space:]])+CAPSULES","[0-9/]+CAPSULES")
str_extract(testdescript,paste0(pattern, collapse = '|'))
##which gives
[1] "3042020 100 TABS" "200 TABLET"       "10TAB"            "100 TAB"          "120 CAPSULES"  

The last 4 results are the desired outcome so only the number of tablets and the word specifying tablets are extracted.最后 4 个结果是所需的结果,因此只提取了药片数量和指定药片的单词。 The first result gives 2 numbers for which the first one (3042020) is not desired.第一个结果给出了第一个 (3042020) 不需要的 2 个数字。 The desired outcome would be 100 TABS.预期的结果是 100 TABS。 I have also tried the following code for the pattern, which gave a similar outcome but with an additional mistake in the fourth result( 200 100 TAB).我还为该模式尝试了以下代码,它给出了类似的结果,但在第四个结果( 200 100 TAB)中有一个额外的错误。

pattern2 =c("([0-9/]|([0-9/]+[[:space:]]))+TABS", "([0-9/]|([0-9/]+[[:space:]]))+TABLET","([0-9/]|([0-9/]+[[:space:]]))+TAB", "([0-9/]|([0-9/]+[[:space:]]))+CAPSULES")
str_extract(testdescript,paste0(pattern2, collapse = '|'))
[1] "3042020 100 TABS" "100 TABLET"       "10TAB"            "200 100 TAB"      "120 CAPSULES"   

My question is: how do I get the number and the form text which may contain spaces without reading in undesired other numbers?我的问题是:如何获取可能包含空格的数字和表单文本而不阅读不需要的其他数字?

Thanks in advance!提前致谢!

I think the pattern you are looking for is the following :我认为您正在寻找的模式如下:

str_extract(string = testdescript, pattern = "[0-9]+ ?(TABS?|TABLETS?|CAPSULES?)")

To explain the pattern above it's looking for a series of number that could be followed by a space or not that's why there is a ?为了解释上面的模式,它正在寻找一系列数字,后面可以跟一个空格,这就是为什么有一个 ? after the space.空间之后。 And after that i'm looking for the words tab, tablet and capsule again testing if they have an S or not.在那之后,我正在寻找单词标签、片剂和胶囊,再次测试它们是否有 S。

这对您的数据集来说太简单了吗?

str_extract(testdescript, "(?<=[[:space:]]|x|X)[0-9]+[[:space:]]?(TAB|TABS|TABLET|CAPSULES)")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM