简体   繁体   中英

R Data Wrangling for Emails

Need Help! this is a work related project. I need to clean 16,000 emails... Expected to do by hand :( I need to find a away to pull the domain name from the email and place it into a new column, and parse the name into a new column as well, while still keeping the original email. The data is partially complete.

library(tidyr)
library(magrittr)

Email.Address <- c('john.doe@abccorp.com','jdoe@cisco.com','johnd@widgetco.com')
First.Name <- c('John', 'JDoe','NA' )
Last.Name <- c('Doe','NA','NA')
Company <- c('NA','NA','NA')

data <- data.frame(Email.Address, First.Name, Last.Name, Company)
separate_DF <- data %>% separate(Email.Address, c("Company"), sep="@")

try this

df  <-  data.frame(Email.Address, First.Name, Last.Name, Company, stringsAsFactors = FALSE)
Corp <- sapply(strsplit(sapply(strsplit(df$Email.Address,"@"),"[[",2),"[.]"),"[[",1)
F.Name <- sapply(strsplit(sapply(strsplit(df$Email.Address,"@"),"[[",1), "[.]"),"[[",1)
L.Name <- sapply(strsplit(sapply(strsplit(df$Email.Address,"@"),"[[",1),"[.]"),tail,n=1)
L.Name[L.Name == F.Name]    <-  NA
OUT <- data.frame(df$Email.Address, F.Name, L.Name, Corp)
df[df=="NA" |is.na(df)] <-  OUT[df=="NA" |is.na(df)]
df

the function separate from tidyr looks good too.

http://blog.rstudio.org/2014/07/22/introducing-tidyr/

From the information you have given, this also works:

library(tidyr)

df  <-  data.frame(Email.Address, First.Name, Last.Name, Company)
df2 <-  separate(df, Email.Address, into = c("Name", "Corp"), sep = "@")
df2 <-  separate(df2, Name, into = c("F.Name", "L.Name"), sep = "[.]", extra = "drop")
df2 <-  separate(df2, Corp, into = c("Corp", "com"), sep = "[.]", extra = "drop")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM