简体   繁体   English

data.frame 的列包含 R 中的矩阵

[英]data.frame with a column containing a matrix in R

I'm trying to put some matrices in a dataframe in R, something like:我正在尝试将一些矩阵放入 R 中的 dataframe 中,例如:

m <- matrix(c(1,2,3,4), nrow=2, ncol=2)
df <- data.frame(id=1, mat=m)

But when I do that, I get a dataframe with 2 rows and 3 columns instead of a dataframe with 1 row and 2 columns.但是,当我这样做时,我得到了一个具有 2 行和 3 列的 dataframe,而不是具有 1 行和 2 列的 dataframe。

Reading the documentation, I have to escape my matrix using I().阅读文档,我必须使用 I() 转义我的矩阵。

df <- data.frame(id=1, mat=I(m))

str(df)
'data.frame':   2 obs. of  2 variables:
 $ id : num  1 1
 $ mat: AsIs [1:2, 1:2] 1 2 3 4

As I understand it, the dataframe contains one row for each row of the matrix, and the mat field is a list of matrix column values.据我了解,dataframe 矩阵的每一行都包含一行,而 mat 字段是矩阵列值的列表。

Thus, how can I obtain a dataframe containing matrices?因此,如何获得包含矩阵的 dataframe?

Thanks !谢谢 !

I find data.frames containing matrices mind-bendingly weird, but: the only way I know to achieve this is hidden in stats:::simulate.lm我发现包含矩阵的 data.frames 令人费解,但是:我知道实现这一点的唯一方法是隐藏在stats:::simulate.lm

Try this, poke through and see what's happening:试试这个,戳穿,看看发生了什么:

d <- data.frame(y=1:5,n=5)
g0 <- glm(cbind(y,n-y)~1,data=d,family=binomial)
debug(stats:::simulate.lm)
s <- simulate(g0,n=5)

This is the weird, back-door solution.这是奇怪的后门解决方案。 Create a list, change its class to data.frame , and then (this is required ) set the names and row.names manually (if you don't do those final steps the data will still be in the object, but it will print out as though it had zero rows...)创建一个列表,将其 class 更改为data.frame ,然后(这是必需的)手动设置namesrow.names (如果您不执行这些最后步骤,数据仍将在 object 中,但它会打印好像它有零行......)

m1 <- matrix(1:10,ncol=2)
m2 <- matrix(5:14,ncol=2)
dd <- list(m1,m2)
class(dd) <- "data.frame"
names(dd) <- LETTERS[1:2]
row.names(dd) <- 1:5
dd

A much easier way to do this is to define the data frame with a placeholder for the matrix一个更简单的方法是使用矩阵的占位符定义数据框

m <- matrix(c(1, 2, 3, 4), nrow = 2, ncol = 2) 
df <- data.frame(id = 1, mat = rep(0, nrow(m)))

Then to assign the matrix.然后分配矩阵。 No need to play with the class of a list or to use an *apply() function.无需使用列表的 class 或使用*apply() function。

df$mat <- m

I came across the same problem trying to understand the gasoline data in pls package.我在尝试了解pls package 中的汽油数据时遇到了同样的问题。 Used $ for the job.$来做这项工作。 First, lets create a matrix, lets call it spectra_mat, then a vector called response_var1.首先,让我们创建一个矩阵,我们称之为spectra_mat,然后是一个名为response_var1 的向量。

spectra_mat = matrix(1:45, 9, 5)
response_var1 = seq(1:9)

Now we put the vector response_var1 in a new data frame - lets call it df.现在我们将向量 response_var1 放入一个新的数据框中——我们称之为 df。

df = data.frame(response_var1)
df$spectra = spectra_mat

To check,去检查,

str(df)

'data.frame':   9 obs. of  2 variables:
 $ response_var1: int  1 2 3 4 5 6 7 8 9
 $ spectra      : int [1:9, 1:5] 1 2 3 4 5 6 7 8 9 10 ...

Data frames containing matrix columns do have their uses in specialized scenarios.包含矩阵列的数据框在特定场景中确实有其用途。 These scenarios are cases when you have a whole vector of some variable for every observation in your data set.这些情况是当您对数据集中的每个观察值都有某个变量的整个向量时。 There are two cases that I have come across where this is common:我遇到过两种常见的情况:

  1. Bayesian analysis: you create a posterior prediction for each observation, so for every "row" in your newdata, you have an entire vector (the length of that vector is the number of MCMC iterations).贝叶斯分析:您为每个观察创建一个后验预测,因此对于新数据中的每一“行”,您都有一个完整的向量(该向量的长度是 MCMC 迭代的次数)。
  2. Functional data analysis: each "observation" is itself a function, and you store the observed realization of that function as a vector.功能数据分析:每个“观察”本身就是一个 function,并且您将观察到的 function 的实现存储为向量。

If you're working with data frames, there are a few obvious ways to handle this data that are both inefficient.如果您正在使用数据框,则有一些明显的方法可以处理这些数据,但这些方法都是低效的。 I'll use the Bayesian case as an example:我将以贝叶斯案例为例:

  1. "Super-wide" format: you have one column for each element of the vectors, in addition to your other columns of the data frame. “超宽”格式:除了数据框的其他列之外,向量的每个元素都有一列。 This makes an extremely wide data frame that is often hard to work with.这会产生一个非常宽的数据框,通常很难使用。 It also makes it difficult to refer to only those columns that correspond to the posterior.这也使得很难仅引用与后验相对应的那些列。
  2. "Super-long" (tidy) format: very memory intensive because all of the other columns of your data frame have to be repeated unnecessarily for every iteration of the posterior. “超长”(整洁)格式:非常 memory 密集型,因为对于后验的每次迭代都必须不必要地重复数据框的所有其他列。
  3. List-columns: you can create a list where each element is the vector corresponding to the posterior for that row of the data frame.列表列:您可以创建一个列表,其中每个元素是对应于数据框该行的后验的向量。 The problem here is that most of the manipulation you want to do will require you to unlist the posterior back to a matrix, and the listing/unlisting is unnecessary compuation.这里的问题是,您想要做的大部分操作都需要您将后验取消列出回矩阵,并且列出/取消列出是不必要的计算。

Data frames with matrix columns are a very useful solution to this situation.带有矩阵列的数据框是解决这种情况的一个非常有用的解决方案。 The posterior stays in a matrix that has the same number of rows as the data frame.后验保留在与数据框具有相同行数的矩阵中。 But that matrix only is recognized as a single "column" in the data frame, and referring to that column using df$mat will return the matrix.但是该矩阵仅被识别为数据框中的单个“列”,并且使用 df$mat 引用该列将返回矩阵。 You can even use some dplyr functions like filtering to return the corresponding rows of the matrix, but this is a bit experimental .您甚至可以使用一些 dplyr 函数(如过滤)来返回矩阵的相应行,但这有点实验性

The easiest method to create the matrix column is in two steps.创建矩阵列的最简单方法是分两个步骤。 First create the data frame without the matrix column, then add the matrix column with a simple assignment.首先创建没有矩阵列的数据框,然后通过简单的赋值添加矩阵列。 I haven't found a 1-step solution to do this that doesn't involve I() which changes the column type.我还没有找到不涉及更改列类型的I()的 1 步解决方案。

m <- matrix(c(1,2,3,4), nrow=2, ncol=2)
df <- data.frame(id = rep(1, nrow(m)))
df$mat <- m
names(df)
# [1] "id"  "mat"
str(df)
# 'data.frame': 2 obs. of  2 variables:
#  $ id : num  1 1
#  $ mat: num [1:2, 1:2] 1 2 3 4

The result you got (2 rows x 3 columns) is what is to be expected from R, as it amounts to cbind a vector ( id , with recycling) and a matrix ( m ).您得到的结果(2 行 x 3 列)是 R 的预期结果,因为它相当于cbind一个向量( id ,带回收)和一个矩阵( m )。

IMO, it would be better to use list or array (when dimensions agree, no mix of numeric and factors values allowed), if you really want to bind different data structures. IMO,如果您真的想绑定不同的数据结构,最好使用listarray (当尺寸一致时,不允许数字和因子值混合)。 Otherwise, just cbind your matrix to an existing data.frame if both have the same number of rows will do the job.否则,只需将您的矩阵cbind到现有的 data.frame 如果两者具有相同的行数就可以完成这项工作。 For example例如

x1 <- replicate(2, rnorm(10))
x2 <- replicate(2, rnorm(10))
x12l <- list(x1=x1, x2=x2)
x12a <- array(rbind(x1, x2), dim=c(10,2,2))

and the results reads结果显示

> str(x12l)
List of 2
 $ x1: num [1:10, 1:2] -0.326 0.552 -0.675 0.214 0.311 ...
 $ x2: num [1:10, 1:2] -0.164 0.709 -0.268 -1.464 0.744 ...
> str(x12a)
 num [1:10, 1:2, 1:2] -0.326 0.552 -0.675 0.214 0.311 ...

Lists are easier to use if you plan to use matrix of varying dimensions, and providing they are organized in the same way (for rows) as an external data.frame you can subset them as easily.如果您计划使用不同维度的矩阵,列表更易于使用,并且如果它们以与外部 data.frame 相同的方式(对于行)进行组织,您可以轻松地对它们进行子集化。 Here is an example:这是一个例子:

df1 <- data.frame(grp=gl(2, 5, labels=LETTERS[1:2]), 
                  age=sample(seq(25,35), 10, rep=T))
with(df1, tapply(x12l$x1[,1], list(grp, age), mean))

You can also use lapply (for list) and apply (for array) functions.您还可以使用lapply (用于列表)和apply (用于数组)函数。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM