简体   繁体   English

将 function 应用于 R 中的每两列

[英]applying function to every two columns in R

Is there a way to use the apply function to every two columns in a data frame?有没有办法对数据框中的每两列使用 apply function? If I have the data frame如果我有数据框

dat <- data.frame(A=rnorm(100), B=rnorm(100),C=rnorm(100), D=rnorm(100))

A           B            C          D
0.1511642 -0.44930197  1.821832535  2.0145395
-1.1639599  0.42685832 -0.763015835 -0.7785278
0.8430158  0.26827386 -0.004560031  0.8823789
0.7103298  0.78512673 -0.968510541  0.5172418
0.8508458  0.05809655  0.391845531  0.7452540
0.2217195 -0.06988857  0.714890499 -1.1536502

and I want the sum of each column I can use我想要我可以使用的每一列的总和

apply(dat,2,sum)

but what if i want to apply a function over every two columns?但是如果我想在每两列上应用一个 function 怎么办? For example例如

coefficients(lm(dat$A~dat$B))
coefficients(lm(dat$C~dat$D))

I have 400 columns and don't want to write this out 200 times for each pair of columns.我有 400 列,不想为每对列写 200 次。 I thought a for loop using columns j and j+1 could work but I want the relationship between column A and B, then column C and D, then column E and F and so on.我认为使用列 j 和 j+1 的 for 循环可以工作,但我想要列 A 和 B 之间的关系,然后是列 C 和 D,然后是列 E 和 F 等等。 Not column A and B, then column B and C, then C and D. Is there a way to do this withe apply() or another function in the apply family?不是 A 列和 B 列,然后是 B 列和 C,然后是 C 和 D。有没有办法用 apply() 或 apply 系列中的另一个 ZC1C425268E68385D1AB5074C17A94F14 来做到这一点?

Create a grouping vector g , split on it and lapply lm over it.创建一个分组向量g ,在其上拆分并在其上应用lm

Note that if d = data.frame(y, x) for response y and predictor x then lm(d) is the regression lm(y ~ x, d) .请注意,如果响应y和预测变量xd = data.frame(y, x)lm(d)是回归lm(y ~ x, d)

n <- ncol(dat)
g <- rep(1:n, each = 2, length = n) # 1 1 2 2 
L <- lapply(split.default(dat, g), lm)

sapply(L, coef) # coefficients
sapply(L, function(x) summary(x)$r.squared) # R^2
# etc.

It could also be done over the names:也可以通过名称完成:

L2 <- lapply(split.default(names(dat), g), function(nms) lm(dat[nms]))
sapply(L2, coef)

or if you want nicer Call: line in the output:或者如果你想要更好的 Call: output 中的行:

reg <- function(nms, dat) do.call("lm", list(reformulate(nms[2], nms[1]), quote(dat)))
L2 <- lapply(split.default(names(dat), g), reg, dat = dat)
sapply(L2, coef)

Note that variables in lm formulas cannot start with a digit so you may need to rename your columns if this requirement is violated.请注意, lm公式中的变量不能以数字开头,因此如果违反此要求,您可能需要重命名列。 If you use the lm(dat) form then this is not a requirement but if you use a formula it is.如果您使用 lm(dat) 形式,那么这不是必需的,但如果您使用公式,则它是必需的。 See Note for examples.有关示例,请参见注释。

Note笔记

Regarding the comment under the question about the form of the names if the names were as shown below we could alternately form g using this code:关于名称形式问题下的评论,如果名称如下所示,我们可以使用以下代码交替形成 g:

# modify test example
s <- c("1234.score1", "1234.score2", "5678.score1", "5678.score2")
dat2 <- setNames(dat, s)

g <- cumsum(sub(".*\\D", "", names(dat2)) == 1)  # 1 1 2 2
L <- lapply(split.default(dat2, g), lm)
sapply(L, coef)

or we could use this (however, this will cause the output to be sorted by g):或者我们可以使用它(但是,这将导致 output 按 g 排序):

# modify column names
dat3 <- dat2
names(dat3) <- paste0("x", names(dat3))

g <- sub("\\..*", "", names(dat3)) # x1234 x1234 x5678 x5678
reg <- function(nms, dat) do.call("lm", list(reformulate(nms[2], nms[1]), quote(dat)))
L2 <- lapply(split.default(names(dat3), g), reg, dat = dat3)
sapply(L2, coef)

You could use mapply / Map to repeat a function every two columns by subsetting your dataframe every two columns.您可以使用mapply / Map每两列重复一次 function ,方法是每两列对 dataframe 进行子集化。 Hope this helps!希望这可以帮助!

Using lm使用lm

lm_list <- Map(function(y, x) summary(lm(y~x))$coefficients, dat[c(T,F)], dat[c(F,T)])
names(lm_list) <- paste0(names(dat[c(T,F)]), " ~ ", names(dat[c(F,T)]))
lm_list

$`A ~ B`
              Estimate Std. Error   t value  Pr(>|t|)
(Intercept) 0.03566648  0.1051079 0.3393320 0.7350857
x           0.03602569  0.1162846 0.3098062 0.7573662

$`C ~ D`
                Estimate Std. Error     t value  Pr(>|t|)
(Intercept) -0.008610382  0.1021835 -0.08426389 0.9330185
x           -0.053369101  0.1171255 -0.45565742 0.6496444

Data :资料

set.seed(42)
dat <- data.frame(A=rnorm(100), B=rnorm(100),C=rnorm(100), D=rnorm(100))

You can take advantage of the naming convention to first stack the data and then operate on the groups of common IDs.您可以利用命名约定首先堆叠数据,然后对公共 ID 组进行操作。 This may make things easier for future analysis.这可能使将来的分析更容易。

I modified the column names per the comment.我根据评论修改了列名。

dat <- data.frame(ID1.score1=rnorm(100), ID1.score2=rnorm(100),ID2.score1=rnorm(100), ID2.score2=rnorm(100))

library(dplyr)
library(stringr)
library(purrr)

Split the column names at ".".在“.”处拆分列名。 The first half are the IDS, the second half specify the score1 or score2 (ie, X or Y).前半部分是IDS,后半部分指定score1 或score2(即X 或Y)。

cols <- str_split(names(dat), "\\.", simplify = TRUE)
ids <- unique(cols[,1])
scores <- unique(cols[,2])

Using purrr , iterate through the IDs and select the column pair that starts with that.使用purrr ,遍历 ID 和 select 以它开头的列对。 Add another column to this new data.frame to store the ID.将另一列添加到这个新的 data.frame 以存储 ID。 Then stack all of these by rows.然后将所有这些按行堆叠。 Now we have a "tidy" formatted dataset.现在我们有了一个“整洁”的格式化数据集。

stacked_dat <- ids %>%
  map_dfr(~ {
    select(dat, starts_with(.)) %>%
      set_names(scores) %>%
      mutate(id = .x)})

Now just group on the ID column and fit the model for each ID.现在只需在 ID 列上进行分组,并为每个 ID 安装 model。

fits <- stacked_dat %>%
  group_by(id) %>%
  do(model = lm(score1 ~ score2, data = .))

Get the model statistics like this in a list.在列表中获取像这样的 model 统计信息。 The package broom might help stack and clean things up, with the help of purrr . package broom可能会在purrr的帮助下帮助堆叠和清理东西。

fits$model

This is something completely different.这是完全不同的东西。 You can create a list of formulas for each pairing based on the names.您可以根据名称为每个配对创建一个公式列表。 Then just iterate over each formula on the same data set.然后只需在同一数据集上迭代每个公式。

dat <- data.frame(ID1.score1=rnorm(100), ID1.score2=rnorm(100),ID2.score1=rnorm(100), ID2.score2=rnorm(100))

ids <- unique(sub("\\..*", "", names(dat)))
f <- lapply(paste0(ids, ".score2 ~ ", ids, ".score1"), as.formula)

models <- lapply(f, function(f) lm(f, dat))

Then you can just extract or do what you want with the list of models.然后,您可以使用模型列表提取或执行您想要的操作。

model_coef <- sapply(models, coef)
colnames(model_coef) <- ids

model_coef

                    ID1         ID2
(Intercept) -0.07592376 -0.02472962
ID1.score1  -0.02284805  0.09144416

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM