简体   繁体   中英

Grouping/recoding factors in the same data.frame

Let's say I have a data frame like this:

df <- data.frame(a=letters[1:26],1:26)

And I would like to "re" factor a, b, and c as "a".

How do I do that?

One option is the recode() function in package car :

require(car)
df <- data.frame(a=letters[1:26],1:26)
df2 <- within(df, a <- recode(a, 'c("a","b","c")="a"'))
> head(df2)
  a X1.26
1 a     1
2 a     2
3 a     3
4 d     4
5 e     5
6 f     6

Example where a is not so simple and we recode several levels into one.

set.seed(123)
df3 <- data.frame(a = sample(letters[1:5], 100, replace = TRUE),
                  b = 1:100)
with(df3, head(a))
with(df3, table(a))

the last lines giving:

> with(df3, head(a))
[1] b d c e e a
Levels: a b c d e
> with(df3, table(a))
a
 a  b  c  d  e 
19 20 21 22 18

Now lets combine levels a and e into level Z using recode()

df4 <- within(df3, a <- recode(a, 'c("a","e")="Z"'))
with(df4, head(a))
with(df4, table(a))

which gives:

> with(df4, head(a))
[1] b d c Z Z Z
Levels: b c d Z
> with(df4, table(a))
a
 b  c  d  Z 
20 21 22 37

Doing this without spelling out the levels to merge:

## Select the levels you want (here 'a' and 'e')
lev.want <- with(df3, levels(a)[c(1,5)])
## now paste together
lev.want <- paste(lev.want, collapse = "','")
## then bolt on the extra bit
codes <- paste("c('", lev.want, "')='Z'", sep = "")
## then use within recode()
df5 <- within(df3, a <- recode(a, codes))
with(df5, table(a))

Which gives us the same as df4 above:

> with(df5, table(a))
a
 b  c  d  Z 
20 21 22 37 

Has anyone tried using this simple method? It requires no special packages, just an understanding of how R treats factors.

Say you want to rename the levels in a factor, get their indices

data <- data.frame(a=letters[1:26],1:26)
lalpha <- levels(data$a)

In this example we imagine we want to know the index for the level 'e' and 'w'

lalpha <- levels(data$a)
ind <- c(which(lalpha == 'e'), which(lalpha == 'w'))

Now we can use this index to replace the levels of the factor 'a'

levels(data$a)[ind] <- 'X'

If you now look at the dataframe factor a there will be an X where there was an e and w

I leave it to you to try the result.

You could do something like:

df$a[df$a %in% c("a","b","c")] <- "a"

UPDATE: More complicated factors.

Data <- data.frame(a=sample(c("Less than $50,000","$50,000-$99,999",
  "$100,000-$249,999", "$250,000-$500,000"),20,TRUE),n=1:20)
rows <- Data$a %in% c("$50,000-$99,999", "$100,000-$249,999")
Data$a[rows] <- "$250,000-$500,000"

there are two ways. if you don't want to drop the unused levels, ie "b" and "c", Joshua's solution is probably best.

if you want to drop the unused levels, then

df$a<-factor(ifelse(df$a%in%c("a","b","c"),"a",as.character(df$a)))

or

levels(df$a)<-ifelse(levels(df$a)%in%c("a","b","c"),"a",levels(df$a))

This is a simplified version of the chosen answer:

I've found that the easiest way to deal with this is to simply overwrite the factor levels by looking at them and then writing the numbers down to be overwritten.

df <- data.frame(a=letters[1:26],1:26)
levels(df)

> [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" 
 "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z"

levels(df$a)[c(1,2)] <- "c"
summary(df$a)

> c d e f g h i j k l m n o p q r s t u v w x y z 
  3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM