[英]assign data.table column from function with column inputs
I have a data table with a few columns that I using as inputs into a phone validation function that I have created. 我有一个带有几列的数据表,用作创建的电话验证功能的输入。
library(data.table)
dt <- data.table(ID = c(1:6),
phone = c("0412 345 789","0438 123 456",
"041 2345 543", "(02) 1234 5678",
"9876 1234", "04123456789"),
state = c("NSW","QLD","SA"),
country = c("AU"),
phone_countries = c("AU","AU","AU","AU,US","AU,US","AU,US"))
# ID phone state country phone_countries
# 1: 1 0412 345 789 NSW AU AU
# 2: 2 0438 123 456 QLD AU AU
# 3: 3 041 2345 543 SA AU AU
# 4: 4 (02) 1234 5678 NSW AU AU,US
# 5: 5 9876 1234 QLD AU AU,US
# 6: 6 04123456789 SA AU AU,US
The function isValidPhone
looks like this (it is designed to validate phone numbers in a few different locations. I have omitted some of the regex's for brevity.) 函数isValidPhone
看起来像这样(它旨在验证几个不同位置的电话号码。为简便起见,我省略了一些正则表达式。)
isValidPhone <- function(phone, state, country, validation_countries) {
if (!(country %in% unlist(strsplit(validation_countries, ","))))
return(FALSE)
# remove whitespace, hyphens and brackets
phone_clean <- gsub("[[:space:]]|-|\\.|\\(|\\)", "", phone)
if (is.na(phone_clean) | phone_clean == '' | is.na(iconv(phone_clean, "", "ASCII")))
return(FALSE)
if (country == "AU") {
# append state area code if length is 8 digits
#print(paste("phone:", phone_clean, "state:", state))
if (nchar(phone_clean, "width") == 8)
if (state %in% c('ACT', 'NSW', 'QLD', 'VIC', 'TAS', 'SA', 'NT', 'WA'))
phone_clean <- switch (state,
'ACT' = paste0("02",phone_clean),
'NSW' = paste0("02",phone_clean),
'QLD' = paste0("07",phone_clean),
'VIC' = paste0("03",phone_clean),
'TAS' = paste0("03",phone_clean),
'SA' = paste0("08",phone_clean),
'NT' = paste0("08",phone_clean),
'WA' = paste0("08",phone_clean))
if (nchar(phone_clean, "width") == 9)
if(substr(phone_clean,1,1) %in% c(2:4,7,8))
phone_clean <- paste0("0", phone_clean)
return(grepl("^(?:\\+?61|0)[23478](?:[ -]?[0-9]){8}$",
as.character(phone_clean), ignore.case=TRUE))
}
}
I am assigning a field in my data.table
dt
called validphone
我在data.table
dt
分配了一个称为validphone
电话的validphone
dt[, validphone := isValidPhone(phone, state, country, phone_countries), by = 1:nrow(dt)]
# ID phone state country phone_countries validphone
# 1: 1 0412 345 789 NSW AU AU TRUE
# 2: 2 0438 123 456 QLD AU AU TRUE
# 3: 3 041 2345 543 SA AU AU TRUE
# 4: 4 (02) 1234 5678 NSW AU AU,US TRUE
# 5: 5 9876 1234 QLD AU AU,US TRUE
# 6: 6 04123456789 SA AU AU,US FALSE
Unfortunately I am having to use by = 1:nrow(dt)
in its current guise as if I don't do that it passes in the full column data into the parameters which causes problems. 不幸的是,我不得不以当前的幌子使用by = 1:nrow(dt)
,好像我没有这样做一样,它会将完整的列数据传递到导致问题的参数中。 This leads to a LOT of function calls on my real data set (~300K) and poor performance. 这导致对我的真实数据集(〜300K)进行了大量函数调用,并且性能不佳。
I did read that it would be better to use a vectorised function, however it is unclear to me how I can do this. 我确实读过,使用向量化函数会更好,但是我不清楚如何做到这一点。
Is there a more efficient way to achieve my desired outcome? 有没有更有效的方法来达到我想要的结果?
There is some re-engineering needed to be able to use your function on vectors: 要对向量使用功能,需要进行一些重新设计:
Mainly replace the if(...) return(FALSE)
by assigning FALSE
on filtered rows and evaluate them in reverse order (last word to first return
=> last word to last assignment) 主要通过在过滤后的行上分配FALSE
来替换if(...) return(FALSE)
,并以相反的顺序对其进行求值(最后return
最后一个单词=>最后分配的最后一个单词)
The switch
also needs to be replaced by ifelse
. 该switch
也需要用ifelse
代替。
You end up with something like this: 您最终得到这样的结果:
isValidPhone <- function(phone, state, country, validation_countries) {
phone_clean <- gsub("[[:space:]]|-|\\.|\\(|\\)", "", phone)
AddArea <- country == "AU" & nchar(phone_clean) == 8 &
state %in% c('ACT', 'NSW', 'QLD', 'VIC', 'TAS', 'SA', 'NT', 'WA')
phone_clean[AddArea] <- ifelse(state[AddArea]%in%c('ACT','NSW'),
paste0("02",phone_clean[AddArea]),
ifelse(state[AddArea]%in%c('VIC','TAS'),
paste0("03",phone_clean[AddArea]),
ifelse(state[AddArea]%in%c('SA','NT', 'WA'),
paste0("08",phone_clean[AddArea]),
paste0("02",phone_clean[AddArea]))))
AddZero <- nchar(phone_clean) == 9 & substr(phone_clean,1,1) %in% c(2:4,7,8)
phone_clean[AddZero] <- paste0("0", phone_clean[AddZero])
ans <- grepl("^(?:\\+?61|0)[23478](?:[ -]?[0-9]){8}$",
as.character(phone_clean), ignore.case=TRUE)
ans[(!(country %in% unlist(strsplit(validation_countries, ",")))) |
is.na(phone_clean) | phone_clean == '' |
is.na(iconv(phone_clean, "", "ASCII"))] <- FALSE
return(ans)
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.