简体   繁体   English

将类从因子更改为数据框中多列的数字

[英]Change the class from factor to numeric of many columns in a data frame

What is the quickest/best way to change a large number of columns to numeric from factor?将大量列从因子更改为数字的最快/最佳方法是什么?

I used the following code but it appears to have re-ordered my data.我使用了以下代码,但它似乎重新排序了我的数据。

> head(stats[,1:2])
  rk                 team
1  1 Washington Capitals*
2  2     San Jose Sharks*
3  3  Chicago Blackhawks*
4  4     Phoenix Coyotes*
5  5   New Jersey Devils*
6  6   Vancouver Canucks*

for(i in c(1,3:ncol(stats))) {
    stats[,i] <- as.numeric(stats[,i])
}

> head(stats[,1:2])
  rk                 team
1  2 Washington Capitals*
2 13     San Jose Sharks*
3 24  Chicago Blackhawks*
4 26     Phoenix Coyotes*
5 27   New Jersey Devils*
6 28   Vancouver Canucks*

What is the best way, short of naming every column as in:最好的方法是什么,而不是像这样命名每一列:

df$colname <- as.numeric(ds$colname)

You have to be careful while changing factors to numeric.将因子更改为数字时必须小心。 Here is a line of code that would change a set of columns from factor to numeric.这是一行代码,可以将一组列从因子更改为数字。 I am assuming here that the columns to be changed to numeric are 1, 3, 4 and 5 respectively.我在这里假设要更改为数字的列分别为 1、3、4 和 5。 You could change it accordingly你可以相应地改变它

cols = c(1, 3, 4, 5);    
df[,cols] = apply(df[,cols], 2, function(x) as.numeric(as.character(x)));

Further to Ramnath's answer, the behaviour you are experiencing is that due to as.numeric(x) returning the internal, numeric representation of the factor x at the R level.除了 Ramnath 的回答之外,您遇到的行为是由于as.numeric(x)在 R 级别返回因子x的内部数字表示。 If you want to preserve the numbers that are the levels of the factor (rather than their internal representation), you need to convert to character via as.character() first as per Ramnath's example.如果要保留作为因子级别的数字(而不是它们的内部表示),则需要首先按照 Ramnath 的示例通过as.character()转换为字符。

Your for loop is just as reasonable as an apply call and might be slightly more readable as to what the intention of the code is.您的for循环与apply调用一样合理,并且对于代码的意图可能更具可读性。 Just change this line:只需更改这一行:

stats[,i] <- as.numeric(stats[,i])

to read阅读

stats[,i] <- as.numeric(as.character(stats[,i]))

This is FAQ 7.10 in the R FAQ.这是 R 常见问题解答中的常见问题解答 7.10

HTH HTH

This can be done in one line, there's no need for a loop, be it a for-loop or an apply.这可以在一行中完成,不需要循环,无论是 for 循环还是应用。 Use unlist() instead :使用 unlist() 代替:

# testdata
Df <- data.frame(
  x = as.factor(sample(1:5,30,r=TRUE)),
  y = as.factor(sample(1:5,30,r=TRUE)),
  z = as.factor(sample(1:5,30,r=TRUE)),
  w = as.factor(sample(1:5,30,r=TRUE))
)
##

Df[,c("y","w")] <- as.numeric(as.character(unlist(Df[,c("y","w")])))

str(Df)

Edit : for your code, this becomes :编辑:对于您的代码,这变为:

id <- c(1,3:ncol(stats))) 
stats[,id] <- as.numeric(as.character(unlist(stats[,id])))

Obviously, if you have a one-column data frame and you don't want the automatic dimension reduction of R to convert it to a vector, you'll have to add the drop=FALSE argument.显然,如果您有一个单列数据框并且您不希望 R 的自动drop=FALSE维将其转换为向量,则必须添加drop=FALSE参数。

I know this question is long resolved, but I recently had a similar issue and think I've found a little more elegant and functional solution, although it requires the magrittr package.我知道这个问题早就解决了,但我最近遇到了一个类似的问题,我认为我找到了一个更优雅和更实用的解决方案,尽管它需要 magrittr 包。

library(magrittr)
cols = c(1, 3, 4, 5)
df[,cols] %<>% lapply(function(x) as.numeric(as.character(x)))

The %<>% operator pipes and reassigns, which is very useful for keeping data cleaning and transformation simple. %<>%运算符用于管道重新分配,这对于保持数据清理和转换简单非常有用。 Now the list apply function is much easier to read, by only specifying the function you wish to apply.现在 list apply 函数更容易阅读,只需指定您希望应用的函数即可。

Here are some dplyr options:以下是一些dplyr选项:

# by column type:
df %>% 
  mutate_if(is.factor, ~as.numeric(as.character(.)))

# by specific columns:
df %>% 
  mutate_at(vars(x, y, z), ~as.numeric(as.character(.))) 

# all columns:
df %>% 
  mutate_all(~as.numeric(as.character(.))) 

I think that ucfagls found why your loop is not working.我认为ucfagls 找到了为什么您的循环不起作用。

In case you still don't want use a loop here is solution with lapply :如果您仍然不想使用循环,这里是lapply解决方案:

factorToNumeric <- function(f) as.numeric(levels(f))[as.integer(f)] 
cols <- c(1, 3:ncol(stats))
stats[cols] <- lapply(stats[cols], factorToNumeric)

Edit.编辑。 I found simpler solution.我找到了更简单的解决方案。 It seems that as.matrix convert to character.似乎as.matrix转换为字符。 So所以

stats[cols] <- as.numeric(as.matrix(stats[cols]))

should do what you want.应该做你想做的。

lapply is pretty much designed for this lapply 几乎就是为此而设计的

unfactorize<-c("colA","colB")
df[,unfactorize]<-lapply(unfactorize, function(x) as.numeric(as.character(df[,x])))

I found this function on a couple other duplicate threads and have found it an elegant and general way to solve this problem.我在其他几个重复的线程上发现了这个函数,并发现它是解决这个问题的一种优雅而通用的方法。 This thread shows up first on most searches on this topic, so I am sharing it here to save folks some time.该线程首先出现在有关此主题的大多数搜索中,因此我在这里分享它以节省人们的时间。 I take no credit for this just so see the original posts here and here for details.我对此不以为然,因此请参阅此处此处的原始帖子以了解详细信息。

df <- data.frame(x = 1:10,
                 y = rep(1:2, 5),
                 k = rnorm(10, 5,2),
                 z = rep(c(2010, 2012, 2011, 2010, 1999), 2),
                 j = c(rep(c("a", "b", "c"), 3), "d"))

convert.magic <- function(obj, type){
  FUN1 <- switch(type,
                 character = as.character,
                 numeric = as.numeric,
                 factor = as.factor)
  out <- lapply(obj, FUN1)
  as.data.frame(out)
}

str(df)
str(convert.magic(df, "character"))
str(convert.magic(df, "factor"))
df[, c("x", "y")] <- convert.magic(df[, c("x", "y")], "factor")

I would like to point out that if you have NAs in any column, simply using subscripts will not work.我想指出的是,如果您在任何列中都有 NA,那么仅使用下标是行不通的。 If there are NAs in the factor, you must use the apply script provided by Ramnath.如果因子中有 NA,则必须使用 Ramnath 提供的应用脚本。

Eg例如

Df <- data.frame(
  x = c(NA,as.factor(sample(1:5,30,r=T))),
  y = c(NA,as.factor(sample(1:5,30,r=T))),
  z = c(NA,as.factor(sample(1:5,30,r=T))),
  w = c(NA,as.factor(sample(1:5,30,r=T)))
)

Df[,c(1:4)] <- as.numeric(as.character(Df[,c(1:4)]))

Returns the following:返回以下内容:

Warning message:
NAs introduced by coercion 

    > head(Df)
       x  y  z  w
    1 NA NA NA NA
    2 NA NA NA NA
    3 NA NA NA NA
    4 NA NA NA NA
    5 NA NA NA NA
    6 NA NA NA NA

But:但:

Df[,c(1:4)]= apply(Df[,c(1:4)], 2, function(x) as.numeric(as.character(x)))

Returns:返回:

> head(Df)
   x  y  z  w
1 NA NA NA NA
2  2  3  4  1
3  1  5  3  4
4  2  3  4  1
5  5  3  5  5
6  4  2  4  4

you can use unfactor() function from "varhandle" package form CRAN:您可以使用来自 CRAN 的“varhandle”包中的unfactor()函数:

library("varhandle")

my_iris <- data.frame(Sepal.Length = factor(iris$Sepal.Length),
                      sample_id = factor(1:nrow(iris)))

my_iris <- unfactor(my_iris)

I like this code because it's pretty handy:我喜欢这段代码,因为它非常方便:

  data[] <- lapply(data, function(x) type.convert(as.character(x), as.is = TRUE)) #change all vars to their best fitting data type

It is not exactly what was asked for (convert to numeric), but in many cases even more appropriate.这并不完全是所要求的(转换为数字),但在许多情况下甚至更合适。

I tried a bunch of these on a similar problem and kept getting NAs.我在类似的问题上尝试了很多这些,并不断获得 NA。 Base R has some really irritating coercion behaviors, which are generally fixed in Tidyverse packages. Base R 有一些非常令人恼火的强制行为,这些行为通常在 Tidyverse 包中得到修复。 I used to avoid them because I didn't want to create dependencies, but they make life so much easier that now I don't even bother trying to figure out the Base R solution most of the time.我曾经避免使用它们,因为我不想创建依赖项,但它们使生活变得更加轻松,以至于现在我什至在大多数时候都不想费心去找出 Base R 解决方案。

Here's the Tidyverse solution, which is extremely simple and elegant:下面是 Tidyverse 的解决方案,非常简单优雅:

library(purrr)

mydf <- data.frame(
  x1 = factor(c(3, 5, 4, 2, 1)),
  x2 = factor(c("A", "C", "B", "D", "E")),
  x3 = c(10, 8, 6, 4, 2))

map_df(mydf, as.numeric)

df$colname <- as.numeric(df$colname)

I tried this way for changing one column type and I think it is better than many other versions, if you are not going to change all column types我尝试用这种方式更改一种列类型,如果您不打算更改所有列类型,我认为它比许多其他版本更好

df$colname <- as.character(df$colname)

for the vice versa.反之亦然。

I had problems converting all columns to numeric with an apply() call:我在使用apply()调用将所有列转换为数字时遇到问题:

apply(data, 2, as.numeric)

The problem turns out to be because some of the strings had a comma in them -- eg "1,024.63" instead of "1024.63" -- and R does not like this way of formatting numbers.问题是因为某些字符串中包含逗号——例如“1,024.63”而不是“1024.63”——而 R 不喜欢这种格式化数字的方式。 So I removed them and then ran as.numeric() :所以我删除了它们,然后运行as.numeric()

data = as.data.frame(apply(data, 2, function(x) {
  y = str_replace_all(x, ",", "") #remove commas
  return(as.numeric(y)) #then convert
}))

Note that this requires the stringr package to be loaded.请注意,这需要加载 stringr 包。

That's what's worked for me.这就是对我有用的。 The apply() function tries to coerce df to matrix and it returns NA's. apply()函数试图将 df 强制转换为矩阵并返回 NA。

numeric.df <- as.data.frame(sapply(df, 2, as.numeric))

Based on @SDahm's answer, this was an "optimal" solution for my tibble :根据@ SDahm的答案,这是我的一个“最优”的解决方案tibble

data %<>% lapply(type.convert) %>% as.data.table()

This requires dplyr and magrittr .这需要dplyrmagrittr

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM