简体   繁体   中英

Adjust factors in dataset with dynamic preceding zero's

I have a big data.frame (1.9M records, with 20 columns). One of the columns is a factor column with values of digits with different length (different number of characters/digits, eg 567839, 234324324, 3243211 etc.) Note: these are numeric codes, no real values, could also be just characters of different lengths for this example.

Now I want to convert does factors to become 13-digit-factors, in such a way that a factor gets preceding zero's in case the number of digits is less than 13.

Example:

Old factor      Length  New factor
432543532532    12      0432543532532
3285087250932   13      3285087250932
464577534       9       0000464577534
2225324324324   13      2225324324324
864235325264    12      0864235325264

I tried different approaches, but now I'm stuck. The problem is that the lengte of the factor differs throughout the dataset.

I tried the following, with an example.

Create data.frame with three different columns on which I perform my code, to identify the problem.

> df.test <- as.data.frame(cbind(c("432543532532", "3285087250932", "464577534", "2225324324324", "864235325264"), c("3285087250932", "132543532532", "464577534", "2225324324324", "864235325264"), c("164577534", "3285087250932", "432543532532", "2225324324324", "864235325264")))
> df.test
             V1            V2            V3
1  432543532532 3285087250932     164577534
2 3285087250932  132543532532 3285087250932
3     464577534     464577534  432543532532
4 2225324324324 2225324324324 2225324324324
5  864235325264  864235325264  864235325264

> levels(df.test$V1) <- paste(substr("0000000000000", 0, 13 - nchar(as.character(levels(df.test$V1)))), levels(df.test$V1), sep = '')
> levels(df.test$V2) <- paste(substr("0000000000000", 0, 13 - nchar(as.character(levels(df.test$V2)))), levels(df.test$V2), sep = '')
> levels(df.test$V3) <- paste(substr("0000000000000", 0, 13 - nchar(as.character(levels(df.test$V3)))), levels(df.test$V3), sep = '')
> df.test
             V1             V2                V3
1  432543532532 03285087250932     0000164577534
2 3285087250932  0132543532532 00003285087250932
3     464577534     0464577534  0000432543532532
4 2225324324324 02225324324324 00002225324324324
5  864235325264  0864235325264  0000864235325264

The problem is that the code nchar(as.character(levels(df.test$V1))) not uses the lengths of the vector df.test$V1 but just one value; the length of the first level of the factor (which is on alphabet/ascending). And it performs the number of necessary preceding zeros on all records. So no vector code!

Note: if I run the 'nchar' code seperately it gives me a vector of the lengths of all the records as a result, so I assumed it should work...

> nchar(as.character(levels(df.test$V1)))
[1] 13 13 12  9 12
> nchar(as.character(levels(df.test$V2)))
[1] 13 14 14 10 13
> nchar(as.character(levels(df.test$V3)))
[1] 13 17 17 16 16

Why isn't nchar(as.character(levels(df.test$V1))) running as a vector operator? Can anybody tell me how to change my code, so it will have the correct result?

Thanks in advance!

NB. Note that in the real case I only need to perform this adjustment on onecolumn of the data.frame .

for zero padding you can use sprintf('%04d', 1:5) but the codes in your example need to be numeric.

max.nchar <- max(nchar(levels(df.test$V1)))

sprintf(paste0('%0',max.nchar), as.numeric(levels(df$V1))[df$V1])

Maybe there is a better way... but you can use gsub with sprintf :

gsub(' ', '0', sprintf('%04s', levels(factor(10:15))))
as.data.frame( lapply(df.test, sprintf, fmt="%013s"))
#---------------------    
         V1            V2            V3
1 0432543532532 3285087250932 0000164577534
2 3285087250932 0132543532532 3285087250932
3 0000464577534 0000464577534 0432543532532
4 2225324324324 2225324324324 2225324324324
5 0864235325264 0864235325264 0864235325264

Your code was not working because substr return "a character vector of the same length and with the same attributes as x (after possible coercion)". So you have to make sure x has as many elements as your expected return value

df.test <- as.data.frame(cbind(c("432543532532", "3285087250932", "464577534", "2225324324324", "864235325264"), c("3285087250932", "132543532532", "464577534", "2225324324324", "864235325264"), c("164577534", "3285087250932", "432543532532", "2225324324324", "864235325264")))
df.test

n <- nrow(df.test)
start <- rep(0, n)
padStrs <- rep("0000000000000", n)
for (thevar in colnames(df.test))) {
    cdiff1 <- 13 - nchar(as.character(levels(df.test[, thevar])))
    levels(df.test[, thevar]) <- paste(substr(padStrs, 0, cdiff), levels(df.test[, thevar]), sep = '')
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM