简体   繁体   中英

Overwrite levels of factor columns in one dataframe using another

I have 2 data frames with multiple factor columns. One is the base data frame and the other is the final data frame. I want to update the levels of the base data frame using the final data frame.

Consider this example:

base <- data.frame(product=c("Business Call", "Business Transactional", 
                             "Monthly Non-Compounding and Standard Non-Compounding",
                             "OCR based Call", "Offsale Call", "Offsale Savings",
                             "Offsale Transactional", "Out of Scope","Personal Call"))
base$product <- as.factor(base$product)

final <- data.frame(product=c("Business Call", "Business Transactional", 
                              "Monthly Standard Non-Compounding", "OCR based Call", 
                              "Offsale Call", "Offsale Savings","Offsale Transactional", 
                              "Out of Scope","Personal Call", "You Money")) 
final$product <- as.factor(final$product)

What I would now want is for the final data base to have the same levels as base and remove the levels which do not exist at all like "You Money". Whereas "Monthly Standard Non-Compounding" to be fuzzy matched

Eg:

levels(base$var1) <- "a" "b" "c"
levels(final$var1) <- "Aa" "Bb" "Cc"

Is there a way to overwrite the levels in base data using the final data using some kind of fuzzy match?

Like I want the final levels for both data to be the same. ie

levels(base$var1) <- "Aa" "Bb" "Cc"
levels(final$var1) <- "Aa" "Bb" "Cc"

We could build our own fuzzyMatcher .

First, we'll need kinda vectorized agrep function,

agrepv <- function(x, y) all(as.logical(sapply(x, agrep, y)))

on which we build our fuzzyMatcher .

fuzzyMatcher <-  function(from, to) { 
  mc <- mapply(function(y) 
    which(mapply(function(x) agrepv(y, x), Map(levels, to))), 
    Map(levels, from))
  return(Map(function(x, y) `levels<-`(x, y), base, 
             Map(levels, from)[mc]))
}

final labels applied on base labels ( note, that I've shifted columns to make it a little more sophisticated):

base[] <- fuzzyMatcher(final1, base1)
#    X1 X2
# 1  Aa Xx
# 2  Aa Xx
# 3  Aa Yy
# 4  Aa Yy
# 5  Bb Yy
# 6  Bb Zz
# 7  Bb Zz
# 8  Aa Xx
# 9  Cc Xx
# 10 Cc Zz

Update

Based on the new provided data above it'll make sense to use another vectorized agrepv2() , which, used with outer() , enables us to apply agrep on all combinations of the levels of both vectors. Hereafter colSums that equal zero give us non-matching levels and which.max the matching levels of the target data frame final . We can use these two resulting vectors on the one hand to delete unused rows of final , on the other hand to subset the desired levels of the base data frame in order to rebuild the factor column.

# add to mimic other columns in data frame
base$x <- seq(nrow(base))
final$x <- seq(nrow(final))

# some abbrevations for convenience
p1 <- levels(base$product)
p2 <- levels(final$product)

# agrep
AGREPV2 <- Vectorize(function(x, y, ...) agrep(p2[x], p1[y]))  # new vectorized agrep 
out <- t(outer(seq(p2), seq(p1), agrepv2, max.distance=0.9))  # apply `agrepv2`
del.col <- grep(0, colSums(apply(out, 2, lengths))) # find negative matches
lvl <- unlist(apply(out, 2, which.max))  # find positive matches
lvl <- as.character(p2[lvl])  # get the labels

# delete "non-existing" rows and re-generate factor with new labels
transform(final[-del.col, ], product=factor(product, labels=lvl))
#                  product x
# 1          Business Call 1
# 2 Business Transactional 2
# 4         OCR based Call 4
# 5           Offsale Call 5
# 6        Offsale Savings 6
# 7  Offsale Transactional 7
# 8           Out of Scope 8
# 9          Personal Call 9

Data

base1 <- structure(list(X1 = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 
3L, 3L), .Label = c("a", "b", "c"), class = "factor"), X2 = structure(c(1L, 
1L, 2L, 2L, 2L, 3L, 3L, 1L, 1L, 3L), .Label = c("x", "y", "z"
), class = "factor")), row.names = c(NA, -10L), class = "data.frame")

final1 <- structure(list(X1 = structure(c(1L, 3L, 1L, 1L, 2L, 3L, 2L, 1L, 
2L, 2L, 3L, 3L, 2L, 2L, 2L), .Label = c("Xx", "Yy", "Zz"), class = "factor"), 
    X2 = structure(c(2L, 1L, 1L, 2L, 2L, 3L, 3L, 1L, 1L, 2L, 
    2L, 2L, 2L, 2L, 3L), .Label = c("Aa", "Bb", "Cc"), class = "factor")), row.names = c(NA, 
-15L), class = "data.frame")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM