简体   繁体   English

使用r从字符串中提取电子邮件地址

[英]Extract e-mail address from string using r

These are 5 twitter user descriptions. 这是5个twitter用户的描述。 The idea is to extract the e-mail from each string. 这个想法是从每个字符串中提取电子邮件。

This is the code i've tried, it works but there is probably something better. 这是我尝试过的代码,它可以工作,但可能还有更好的方法。 I'd rather avoid using unlist() and do it in one go using regex. 我宁愿避免使用unlist()并使用正则表达式一次性完成此操作。 I've seen other questions of the kind for python/perl/php but not for R. I know i could use grep(..., perl = TRUE) but that should't be the only way to do it. 我见过其他针对python / perl / php的问题,但没有针对R的问题。我知道我可以使用grep(...,perl = TRUE),但这不应该是唯一的方法。 If it works, of course it helps. 如果有效,那当然有帮助。

ds <- c("#MillonMusical | #PromotorMusical | #Diseñador | Contacto :        ezequielife@gmail.com | #Instagram : Ezeqielgram | 01-11-11 |           @_MillonMusical @flowfestar", "LipGLosSTudio by: SAndry RUbio           Maquilladora PRofesional estudiande de diseño profesional de maquillaje     artistico lipglosstudio@hotmail.com/", "Medico General Barranquillero   radicado con su familia en Buenos Aires para iniciar Especialidad       Medico Quirurgica. email jaenpavi@hotmail.com", "msn =
    rdt031169@hotmail.comskype = ronaldotorres-br", "Aguante piscis /       manuarias17@gmail.com  buenos aires"
    )

ds <- unlist(strsplit(ds, ' '))
ds <- ds[grep("mail.", ds)]

> print(ds)
[1] "\t\tezequielife@gmail.com"  "lipglosstudio@hotmail.com/"
[3] "jaenpavi@hotmail.com"       "rdt031169@hotmail.comskype"
[5] "/\t\tmanuarias17@gmail.com"

It would be nice to separate this one "rdt031169@hotmail.comskype" perhaps asking it to end in .com or .com.ar that would make sense for what i'm working on 最好将这个“ rdt031169@hotmail.comskype”分开,也许要求它以.com或.com.ar结尾,这对我正在研究的内容有意义

Here's one alternative: 这是一种选择:

> regmatches(ds, regexpr("[[:alnum:]]+\\@[[:alpha:]]+\\.com", ds))
[1] "ezequielife@gmail.com"     "lipglosstudio@hotmail.com" "jaenpavi@hotmail.com"      "rdt031169@hotmail.com"    
[5] "manuarias17@gmail.com" 

Based on @Frank's comment, if you want to keep country identifier after .com as in your example .com.ar then, look at this: 根据.com.ar的评论,如果您想像在示例.com.ar那样在.com之后保留国家/地区标识符,请查看以下内容:

> ds <- c(ds, "fulanito13@somemail.com.ar")  # a new e-mail address
> regmatches(ds, regexpr("[[:alnum:]]+\\@[[:alpha:]]+\\.com(\\.[a-z]{2})?", ds))
[1] "ezequielife@gmail.com"      "lipglosstudio@hotmail.com"  "jaenpavi@hotmail.com"       "rdt031169@hotmail.com"     
[5] "manuarias17@gmail.com"      "fulanito13@somemail.com.ar"

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM