[英]regex Clean Hourly Wage column fast in data.table R
I am trying to clean up an unstructured data column. 我正在尝试清理非结构化数据列。 I just want to strip out the numeric portion of the column. 我只想删除列的数字部分。 No dollar symbol or anything else before or after the wage number. 工资编号之前或之后没有美元符号或其他任何符号。
Currently, I am using a foreach loop, but it is really slow on the actual table of 10,000 rows. 当前,我正在使用一个foreach循环,但是在10,000行的实际表上它确实很慢。 In the data table foo
, startPay
is the original data format and startPayCLEAN
is the desired result. 在数据表foo
, startPay
是原始数据格式,而startPayCLEAN
是所需的结果。
library(data.table)
foo$startPayCLEAN <- NA
foo <- data.table(startPay=c("12.00 hr","$12.02","$8.00 per hour","18.00 ph","10.50 pre hr."))
foo[,id:=seq.int(1,nrow(foo))]
rowCount <- seq.int(1,nrow(foo))
startPay <- foreach (i=rowCount,.combine=rbind,.packages='data.table') %do% {
if (unlist(gregexpr("[0-9.]",foo$startPay)[i])==-1) {
NA } else {
charList <- unlist(gregexpr("[.0-9]",foo$startPay)[i])
charList <- charList[which(charList<8)]
substr(foo$startPay[i],min(charList),max(charList))
}
}
foo$startPayCLEAN <- startPay
I think that you just need to use gsub to select the numeric part. 我认为您只需要使用gsub选择数字部分。
gsub(".*?(\\d+\\.\\d+).*", "\\1", foo$startPay)
[1] "12.00" "12.02" "8.00" "18.00" "10.50"
You may want to convert it to a number. 您可能需要将其转换为数字。
as.numeric(gsub(".*?(\\d+\\.\\d+).*", "\\1", foo$startPay))
[1] 12.00 12.02 8.00 18.00 10.50
You should be able to do this one regex: 您应该能够执行这一正则表达式:
library(data.table)
foo <- data.table(startPay=c("12.00 hr","$12.02","$8.00 per hour","18.00 ph","10.50 pre hr."))
foo[, startPayCLEAN := gsub("(^\\.|[^0-9.]|\\.$)", replacement = "", startPay)]
here regex can be split into three parts (by pipes): 在这里,正则表达式可以分为三部分(通过管道):
^\\\\.
- string starts from dot -字符串从点开始 [^0-9.]
- string is not a number or a dot [^0-9.]
-字符串不是数字或点 \\\\.$
string ends with a dot \\\\.$
字符串以点结尾 gsub
finds matching characters in startPay
and replaces them with an empty string. gsub
在startPay
找到匹配的字符,并将它们替换为空字符串。
in regex pipe is OR. 在正则表达式管道中为OR。 (a|b)
will match either a
or b
. (a|b)
将匹配a
或b
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.