简体   繁体   English

R - 在多列上应用相同的 function

[英]R - Applying same function on multiple columns

This is my first time asking a question here and I'm a beginner at R.这是我第一次在这里提问,我是 R 的初学者。

I have a huge dataset, where I want to get some overview of the values of multiple columns, based on category:我有一个巨大的数据集,我想根据类别对多列的值进行一些概述:

sampleID|category|element_1|element_2|element_3|element_4|
----------------------------------------------------------
    1   |    A   |  12.53  |   46.17 |   94.09 |  25.23  |
    2   |    B   |  19.53  |   16.17 |   14.09 |  28.23  |
    3   |    C   |  21.53  |   56.17 |   24.09 |  26.23  |
    4   |    D   |  18.53  |   96.17 |   34.09 |  21.23  |
    5   |    B   |  17.53  |   76.17 |   44.09 |  24.23  |
    6   |    A   |  32.53  |   36.17 |   54.09 |  25.23  |

What I've been trying to do is get a mean of each element by each category, what I've been mostly trying are things around tapply function in R:我一直在尝试做的是按每个类别获取每个元素的平均值,我一直在尝试的是围绕 R 中的tapply function 的事情:

tapply(data$element1, data$category, mean)

This gives me nice results for one element column, but I cannot seem to find an answer how to do that on all columns, without doing it on each column of elements by hand (mean of element1, element2, element3 etc. by category).这为一个元素列提供了很好的结果,但我似乎无法找到如何在所有列上执行此操作的答案,而不是手动对每一列元素进行操作(按类别划分的 element1、element2、element3 等的平均值)。

What I want is this:我想要的是这样的:

category | element_1| element_2| element_3 
     A   |   mean   |   mean   |   mean
     B   |   mean   |   mean   |   mean
     C   |   mean   |   mean   |   mean

I've tried versions of apply and aggregate , but cannot get it to work.我已经尝试过applyaggregate的版本,但无法使其正常工作。

Any advice is appreciated, if I need to supply more information, please let me know!任何建议表示赞赏,如果我需要提供更多信息,请告诉我!

If you only want to aggregate the columns you can use the dplyr library.如果您只想聚合列,可以使用 dplyr 库。

library(dplyr)
df = data.frame(sample_id = c(1,2,3,4),
                category = c("A", "B", "C", "A"),
                element1 = c(1,2,3,4),
                element2 = c(5,6,7,8),
                element3 = c(9,10,11,12))

summarise_if(df, is.numeric, mean)

or equivalent或同等学历

df %>% summarise_if(is.numeric, mean)

This will apply the function mean to every column that is numeric.这会将 function 均值应用于每个数字列。

If you want more information than just the mean, you could look at the summary statistics.如果您想要更多信息而不仅仅是平均值,您可以查看汇总统计信息。

Let's create some sample data:让我们创建一些示例数据:

library(tidyverse)
set.seed(1)

my_data <- as_tibble(matrix(runif(100), ncol = 10,
                            dimnames = list(rows = NULL,
                                            cols = paste0("Var_", 1:10))))

Now, we can see the full summary statistics by just using summary:现在,我们可以通过使用 summary 来查看完整的汇总统计信息:

summary(my_data)

# Alternatively 
my_data %>%
  summary

You can use the colMeans function from base (or from the matrixStats of Rfast packages).您可以使用来自 base(或来自colMeans包的 matrixStats)的matrixStats Rfast

my_data %>%
  colMeans

If you only want to do it on a subset of your data, you can use the select function如果您只想对数据的子集执行此操作,则可以使用 select function

my_data %>%
  select(Var_1, Var_2) %>%
  colMeans

Note that when you use colMeans as above, without first selecting only the numeric variables, it will throw an error.请注意,当您如上所述使用colMeans时,如果没有先选择仅数字变量,则会引发错误。 summary will still work without a problem. summary仍然可以正常工作。

EDIT:编辑:

Taking your comment into account and re-reading your (updated) question, this might be closer to what you are looking for.考虑到您的评论并重新阅读您的(更新的)问题,这可能更接近您正在寻找的内容。

library(tidyverse)

set.seed(1)

data <- tibble(
  sampleID = 1:6,
  category = c("A", "B", "C", "D", "B", "A"),
  element_1 = runif(6)*10,
  element_2 = runif(6)*10,
  element_3 = runif(6)*10,
  element_4 = runif(6)*10
  )

Which gives a dataset that looks like this:这给出了一个如下所示的数据集:

# A tibble: 6 x 6
  sampleID category element_1 element_2 element_3 element_4
     <int> <chr>        <dbl>     <dbl>     <dbl>     <dbl>
1        1 A             4.97     7.80       2.52      5.06
2        2 B             9.93     7.62       4.23      7.16
3        3 C             3.77     6.16       2.02      1.51
4        4 D             4.78     0.510      5.02      4.79
5        5 B             1.67     6.96       3.14      2.58
6        6 A             6.07     9.76       9.99      6.47

Now, we can just make a small change and use the group_by() function现在,我们只需做一个小改动并使用group_by() function

data %>%
  group_by(category) %>%
  summarize_if(is.numeric, mean)

Which will give the desired output:这将给出所需的 output:

  category sampleID element_1 element_2 element_3 element_4
  <chr>       <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
1 A             3.5      5.52     8.78       6.26      5.77
2 B             3.5      5.80     7.29       3.69      4.87
3 C             3        3.77     6.16       2.02      1.51
4 D             4        4.78     0.510      5.02      4.79

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM