简体   繁体   中英

regex Clean Hourly Wage column fast in data.table R

I am trying to clean up an unstructured data column. I just want to strip out the numeric portion of the column. No dollar symbol or anything else before or after the wage number.

Currently, I am using a foreach loop, but it is really slow on the actual table of 10,000 rows. In the data table foo , startPay is the original data format and startPayCLEAN is the desired result.

library(data.table)
foo$startPayCLEAN <- NA
foo <- data.table(startPay=c("12.00 hr","$12.02","$8.00 per hour","18.00 ph","10.50 pre hr."))
foo[,id:=seq.int(1,nrow(foo))]
rowCount <- seq.int(1,nrow(foo))
startPay <- foreach (i=rowCount,.combine=rbind,.packages='data.table') %do% {
  if (unlist(gregexpr("[0-9.]",foo$startPay)[i])==-1) {
    NA } else {
      charList <- unlist(gregexpr("[.0-9]",foo$startPay)[i])
      charList <- charList[which(charList<8)]
      substr(foo$startPay[i],min(charList),max(charList))
    }
}

foo$startPayCLEAN <- startPay

I think that you just need to use gsub to select the numeric part.

gsub(".*?(\\d+\\.\\d+).*", "\\1", foo$startPay)
[1] "12.00" "12.02" "8.00"  "18.00" "10.50"

You may want to convert it to a number.

as.numeric(gsub(".*?(\\d+\\.\\d+).*", "\\1", foo$startPay))
[1] 12.00 12.02  8.00 18.00 10.50

You should be able to do this one regex:

library(data.table)

foo <- data.table(startPay=c("12.00 hr","$12.02","$8.00 per hour","18.00 ph","10.50 pre hr."))
foo[, startPayCLEAN := gsub("(^\\.|[^0-9.]|\\.$)", replacement = "", startPay)]

here regex can be split into three parts (by pipes):

  • ^\\\\. - string starts from dot
  • [^0-9.] - string is not a number or a dot
  • \\\\.$ string ends with a dot

gsub finds matching characters in startPay and replaces them with an empty string.

in regex pipe is OR. (a|b) will match either a or b .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM