简体   繁体   中英

R data.frame strange behavior when converting characters to numeric

I am dealing with a dataset containing US States FIPS codes coded as characters, where codes from 1 to 9 sometimes have a 0 prefix (01, 02,...). While trying to clean it up I came across the following issue:

test <- data.frame(fips = c(1,"01")) %>%
mutate(fips = as.numeric(fips))

> test
  fips
1    2
2    1

where 1 is converted as a 2, and 01 as a 1. This annoying behavior disappears with a tibble:

test <- tibble(fips = c(1,"01")) %>%
        mutate(fips = as.numeric(fips))
> test
# A tibble: 2 x 1
   fips
  <dbl>
1     1
2     1

Does anyone know what is going on? Thanks

This is a difference in the defaults for tibbles and data.frames. When you mix together strings and numbers as in c(1, "01"), R converts everything to a string.

c(1, "01")
[1] "1"  "01"

The default behavior for data.frame is to make strings into factors. If you look at the help page for data.frame you will see the argument:

stringsAsFactors: ... The 'factory-fresh' default is TRUE

So data frame makes c(1, "01") into a factor with two levels "1" and "01"

T1 = data.frame(fips = c(1,"01")) 
str(T1)
'data.frame':   2 obs. of  1 variable:
 $ fips: Factor w/ 2 levels "01","1": 2 1

Now factors are stored as integers for efficiency. That is why you see 2 1 at the end of the about output of str(T1). So if you directly convert that to an integer, you get 2 and 1.

You can get the behavior that you want, either by making the data.frame more carefully with

T1 = data.frame(fips = c(1,"01"), stringsAsFactors=FALSE)

or you can convert the factor to a string before converting to a number

fips = as.numeric(as.character(fips))

Tibbles do not have this problem because they do not convert the strings to factors.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM