繁体   English   中英

model.matrix引发内存分配错误

[英]model.matrix raises memory allocation error

我正在使用model.matrix从现有数据框中创建许多列。 目标是创建许多列,每个列的名称都等于一个要素列的不同值( my_one_feature )。 也就是说,如果my_one_feature是值为{cat_1,cat_2,cat_3}的类别变量,则我希望生成3个其他列,其名称为: cat_1cat_2cat_3并且每个值的取值为0或1,具体取决于它们的存在。对应的行。

mm <- model.matrix(~factor(my_one_feature)-1,data=my_data_frame)

那我可以

cbind(my_data_frame,mm)

我认为功能任务正是我所解释的。 但是,对于大数据和/或大特征值,会产生内存分配错误:

cannot allocate vector of size 50 Gb

我知道结果矩阵将是稀疏的。 如何避免这种内存分配问题?

这是一个只有7行的具有4个原始功能的示例:

f1<-c('f1_1','f1_2','f1_1','f1_3','f1_3','f1_1','f1_4')
f2<-c(1,2,3,4,2,4,2)
f3<-c(1,2,3,4,5,6,7)
f4<-c(0,0,1,1,1,0,1)`

my_data_frame<-data.frame(f1,f2,f3,f4)

看起来像:

my_data_frame
    f1 f2 f3 f4
1 f1_1  1  1  0
2 f1_2  2  2  0
3 f1_1  3  3  1
4 f1_3  4  4  1
5 f1_3  2  5  1
6 f1_1  4  6  0
7 f1_4  2  7  1

mm<-sparse.model.matrix(~factor(f1)-1,data=my_data_frame)

看起来像:

7 x 4 sparse Matrix of class "dgCMatrix"
  factor(f1)f1_1 factor(f1)f1_2 factor(f1)f1_3 factor(f1)f1_4
1              1              .              .              .
2              .              1              .              .
3              1              .              .              .
4              .              .              1              .
5              .              .              1              .
6              1              .              .              .
7              .              .              .              1

如何将my_data_frame与mm组合以使生成的对象可以具有所有(f1, f2, f3, f4, factor(f1)f1_1, factor(f1)f1_2, factor(f1)f1_3, factor(f1)f1_4))(f1, f2, f3, f4, factor(f1)f1_1, factor(f1)f1_2, factor(f1)f1_3, factor(f1)f1_4))和当然是7行。

您的答案会在我的rstudio工具上给出以下结果:

> my_data_frame <- data.frame(
+     f1=c('f1_1','f1_2','f1_1','f1_3','f1_3','f1_1','f1_4'),
+     f2=c(1,2,3,4,2,4,2),
+     f3=c(1,2,3,4,5,6,7),
+     f4=c(0,0,1,1,1,0,1))
> library("Matrix")
> mm <- sparse.model.matrix(~factor(f1)-1,
+                           data=my_data_frame)
> new_data_frame <- cbind(Matrix(as.matrix(my_data_frame[,-1])),
+                         mm)
> dim(new_data_frame)
[1] 1 2
> str(new_data_frame)
List of 2
 $ :Formal class 'dgeMatrix' [package "Matrix"] with 4 slots
  .. ..@ x       : num [1:21] 1 2 3 4 2 4 2 1 2 3 ...
  .. ..@ Dim     : int [1:2] 7 3
  .. ..@ Dimnames:List of 2
  .. .. ..$ : NULL
  .. .. ..$ : chr [1:3] "f2" "f3" "f4"
  .. ..@ factors : list()
 $ :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
  .. ..@ i       : int [1:7] 0 2 5 1 3 4 6
  .. ..@ p       : int [1:5] 0 3 4 6 7
  .. ..@ Dim     : int [1:2] 7 4
  .. ..@ Dimnames:List of 2
  .. .. ..$ : chr [1:7] "1" "2" "3" "4" ...
  .. .. ..$ : chr [1:4] "factor(f1)f1_1" "factor(f1)f1_2" "factor(f1)f1_3" "factor(f1)f1_4"
  .. ..@ x       : num [1:7] 1 1 1 1 1 1 1
  .. ..@ factors : list()
 - attr(*, "dim")= int [1:2] 1 2
 - attr(*, "dimnames")=List of 2
  ..$ : NULL
  ..$ : chr [1:2] "" "mm"
> 


> sessionInfo()
R version 3.1.3 (2015-03-09)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 8 x64 (build 9200)

locale:
[1] LC_COLLATE=Lithuanian_Lithuania.1257  LC_CTYPE=Lithuanian_Lithuania.1257    LC_MONETARY=Lithuanian_Lithuania.1257 LC_NUMERIC=C                         
[5] LC_TIME=Lithuanian_Lithuania.1257    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] Matrix_1.2-2

loaded via a namespace (and not attached):
[1] grid_3.1.3      lattice_0.20-30 tools_3.1.3    
> 

设置数据:

 my_data_frame <- data.frame(
 f1=c('f1_1','f1_2','f1_1','f1_3','f1_3','f1_1','f1_4'),
 f2=c(1,2,3,4,2,4,2),
 f3=c(1,2,3,4,5,6,7),
 f4=c(0,0,1,1,1,0,1))

现在使用sparse.model.matrix作为分类功能:

library("Matrix")
mm <- sparse.model.matrix(~factor(f1)-1,
           data=my_data_frame)

结合回的数值预测(强迫data.frame - > matrix - > Matrix ):

new_data_frame <- cbind(Matrix(as.matrix(my_data_frame[,-1])),
                        mm)

结果:

dim(new_data_frame)
## [1] 7 7 
str(new_data_frame)
## Formal class 'dgeMatrix' [package "Matrix"] with 4 slots
##   ..@ x       : num [1:49] 1 2 3 4 2 4 2 1 2 3 ...
##   ..@ Dim     : int [1:2] 7 7
##   ..@ Dimnames:List of 2
##   .. ..$ : chr [1:7] "1" "2" "3" "4" ...
##   .. ..$ : chr [1:7] "f2" "f3" "f4" "factor(f1)f1_1" ...
##   ..@ factors : list()

object.size(new_data_frame) ## 1596 bytes

结果包含原始f1列,因为矩阵不能有异类类型-但就没有办法使用该列原始形式,在任何情况下,数值模拟和预测...

会话信息(OP使用3.1.3 / windows 8 x64 /立陶宛语言环境/Matrix_1.2-2/tools_3.1.3):

R version 3.2.1 (2015-06-18)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.9.5 (Mavericks)

locale:
[1] en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] Matrix_1.2-2

loaded via a namespace (and not attached):
[1] grid_3.2.1      lattice_0.20-33

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM