[英]How to search for multiple strings and replace them with nothing within a list of strings
I have a column in a dataframe like this: 我在数据框中有一个列,如下所示:
npt2$name
# [1] "Andreas Groll, M.D."
# [2] ""
# [3] "Pan-Chyr Yang, PHD"
# [4] "Suh-Fang Jeng, Sc.D"
# [5] "Mostafa K Mohamed Fontanet Arnaud"
# [6] "Thomas Jozefiak, M.D."
# [7] "Medical Monitor"
# [8] "Qi Zhu, MD"
# [9] "Holly Posner"
# [10] "Peter S Sebel, MB BS, PhD Chantal Kerssens, PhD"
# [11] "Lance A Mynderse, M.D."
# [12] "Lawrence Currie, MD"
I tried gsub
but with no luck. 我试过
gsub
但没有运气。 After doing toupper(x)
I need to replace all instances of 'MD' or 'MD' or 'PHD' with nothing. 在做了
toupper(x)
我需要用什么都不替换'MD'或'MD'或'PHD'的所有实例。
Is there a nice short trick to do it? 有一个很好的简短技巧吗?
In fact I would be interested to see it done on a single string and how differently it is done in one command on the whole list. 事实上,我有兴趣看到它在一个字符串上完成,并且在整个列表中的一个命令中完成的方式有多么不同。
Either of these: 这些都是:
gsub("MD|M\\.D\\.|PHD", "", test) # target specific strings
gsub("\\,.+$", "", test) # target all characters after comma
Both Matt Parker above and Tommy below have raised the question whether 'MRCP', 'PhD', 'D.Phil.' 上面的Matt Parker和下面的Tommy都提出了“MRCP”,“PhD”,“D.Phil”的问题。 and 'Ph.D.'
和'博士' or other British or Continental designations of doctorate level degrees should be sought out and removed.
或者应该寻找和删除其他英国或大陆的博士学位。 Perhaps @user56 can advise what the intent was.
也许@ user56可以告知意图是什么。
With a single ugly regex: 有一个丑陋的正则表达式:
gsub('[M,P].?D.?','',npt2$name)
Which says, find characters M or P followed by zero or one character of any kind, followed by a D and zero or one additional character. 其中说,找到字符M或P后跟零或任何一种字符,后跟D和零或一个附加字符。 More explicitly, you could do this in three steps:
更明确地说,您可以通过三个步骤完成此操作:
npt2$name <- gsub('MD','',npt2$name)
npt2$name <- gsub('M\\.D\\.','',npt2$name)
npt2$name <- gsub('PhD','',npt2name)
In those three, what's happening should be more straight forward. 在这三者中,正在发生的事情应该更加直截了当。 the second replacement you need to "escape" the period since its a special character.
第二次替换,你需要“逃避”这个特殊角色的时期。
Here's a variant that removes the extra ", " too. 这是一个删除额外“,”的变体。 Does not require
touppper
either - but if you want that, just specify ignore.case=TRUE
to gsub
. 不需要
touppper
- 但如果你想要,只需指定ignore.case=TRUE
到gsub
。
test <- c("Andreas Groll, M.D.",
"",
"Pan-Chyr Yang, PHD",
"Suh-Fang Jeng, Sc.D",
"Peter S Sebel, MB BS, PhD Chantal Kerssens, PhD",
"Lawrence Currie, MD")
gsub(",? *(MD|M\\.D\\.|P[hH]D)", "", test)
#[1] "Andreas Groll" ""
#[3] "Pan-Chyr Yang" "Suh-Fang Jeng, Sc.D"
#[5] "Peter S Sebel, MB BS Chantal Kerssens" "Lawrence Currie"
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.