简体   繁体   中英

R: How are factor labels mapped to the correct values in a data.frame?

EDIT: Included my reading of the documentation that I am still not clear on

I'm new to R and playing around with the RStudio pre-loaded mtcars data.frame . I am converting the cyl variable to factors and labeling them. My code is:

df <- mtcars
str(df)
df$cyl <- factor(df$cyl, labels = c('Four cylinder', 'Six Cylinder', 'Eight Cylinder'))
str(df)

Which outputs:

> df <- mtcars
> str(df)
'data.frame':   32 obs. of  11 variables:
 $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
 $ disp: num  160 160 108 258 360 ...
 $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec: num  16.5 17 18.6 19.4 17 ...
 $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
 $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
 $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
 $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
> df$cyl <- factor(df$cyl, labels = c('Four cylinder', 'Six Cylinder', 'Eight Cylinder'))
> str(df)
'data.frame':   32 obs. of  11 variables:
 $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl : Factor w/ 3 levels "Four cylinder",..: 2 2 1 2 3 2 3 1 1 2 ...
 $ disp: num  160 160 108 258 360 ...
 $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec: num  16.5 17 18.6 19.4 17 ...
 $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
 $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
 $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
 $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

My question is: How is the factor section of code assigning the labels properly (ie 'Four cylinder' , represented as 1 after the transformation, is being correctly assigned the the 4 s in the original df ). Is it simply applying the labels in ascending order as the default behavior? What if I have a field with, say, 10 unique values that I want to convert to factors. How can I be sure my labels and replacement values are corresponding to the correct original values?

The documentation accessed by ?factor states:

levels: an optional vector of the values (as character strings) that x might have taken. The default is the unique set of values taken by as.character(x), sorted into increasing order of x.

This appears to be stating that the labels are to be applied in ascending order by original variable value, but I just want to make sure I'm understanding correctly.

It knows in this example, because it converts the numeric values in mtcars$cyl into a character vector c(4, 6, 8, 6, ...) -> c("4", "6", "8", ...) , the levels are selected by alphnumerical sorting ('4' then '6', then '8'; since you haven't specified the levels in your call to factor ), then the numeric values stored in df$cyl are found by matching the values against the levels . The labels don't really affect the factor ordering: you could, perversely, match the label "six cylinders" with the level "4".

as.numeric(factor(c(4, 6, 8, 6, 6, 4))) [1] 1 2 3 2 2 1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM