繁体   English   中英

如何从 R 中的 dataframe 中的行中删除模式?

[英]How to remove a pattern from rows in a dataframe in R?

我的数据行包含通常在末尾具有 email 地址的机构。 我只想删除 email 广告并保留机构(例如删除 hello@canada)。

df <- data.frame(institute = c(
"Air Quality Processes Research Section, Environment and Climate Change Canada, Toronto, Ontario, M3H 5T4, Canada",
"Air Quality Processes Research Section, Environment and Climate Change Canada, Toronto, Ontario, M3H 5T4, Canada. Electronic address: hello@canada",
"Aix-Marseille Universit.., Inserm, TAGC UMR S1090, 13288 Marseille, France. name@inserm",
"Applied Biological Sciences Program, Chulabhorn Graduate Institute, Bangkok, Thailand Laboratory of Biochemistry, Chulabhorn Research Institute, Bangkok, Thailand",
"Applied Biological Sciences Program, Chulabhorn Graduate Institute, Bangkok, Thailand Laboratory of Biochemistry, Chulabhorn Research Institute, Bangkok, Thailand emailX@yahoo.com"))

我的目标是能够将相同的机构算作一个机构,因为在上述格式中,email 地址使行不同。

我为第一个研究所尝试了下面的代码,但它没有删除完整的 email 地址。

a <- "Air Quality Processes Research Section, Environment and Climate Change Canada, Toronto, Ontario, M3H 5T4, Canada. Electronic address: hello@canada"
gsub("[^.*?]@.*", "\\1", a)
# [1] "Air Quality Processes Research Section, Environment and Climate Change Canada, Toronto, Ontario, M3H 5T4, Canada. Electronic address: hell"

你可以使用这样的东西:

df$clean_institute <- trimws(gsub('\\w+@.*$|Electronic address:|email address:', 
                                  '', df$institute))

这将删除'@''@'之前的一个单词以及它之后的所有内容。 除此之外,它还删除了诸如'Electronic address:''email address:'类的词。

然后用table来计数

table(df$clean_institute)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM