How calculate kruskal.test for each variable for many datasets in R and save p-value

Question

i have many datasets in my workdir.

getwd()
[1] "C:/Users/mi/Documents"

but for reproducible example i provide only 2 of them.

alt=structure(list(groupter = c(1L, 2L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 
1L, 2L, 1L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 
2L, 2L, 1L, 1L, 2L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 1L), screen = c(12.2, 
24.4, 13.5, 18.5, 13.9, 16.6, 12, 16.6, 13.5, 15.9, 11.5, 9.6, 
44, 22.2, 17.1, 31.2, 13.7, 39.9, 11.5, 20, 27.5, 18.5, 22.2, 
21.9, 18.3, 42.1, 16.4, 16.6, 12, 28.7, 10.3, 33.6, 10.1, 22.7, 
7.2, 28, 16.4, 13.2), vizit.1.day.2 = c(16.1, 9.8, 9.3, 21, 11.3, 
9.8, 11.3, 16.6, 15.4, 14.9, 11, 10.3, 22.7, 15.4, 33.3, 15.2, 
9.3, 32.1, 10.3, 13.9, 32.1, 14.4, 23.2, 17.1, 17.8, 27, 15.4, 
29.9, 12.2, 16.8, 9.6, 18.1, 10.5, 15.4, 13.2, 11.5, 20, 9.6), 
    vizit.2.day.9 = c(10.1, 16.4, 11.5, 21.9, 20, 12.5, 12.5, 
    13.5, 14.9, 17.1, 10.8, 11, 21.7, 14.4, 16.4, 34.5, 9.3, 
    23.6, 12, 12.5, 32.6, 11.3, 19.3, 16.4, 12.2, 30.7, 12, 28, 
    14.4, 17.1, 9.8, 22.7, 11.5, 13.2, 11.5, 10.5, 13.9, 14.9
    ), vizit.3.day.16 = c(22.7, 12.7, 22.4, 16.4, 12.2, 11, 10.8, 
    13, 13, 12.5, 9.6, 8.6, 17.8, 12.2, 13.5, 22.4, 8.4, 26.8, 
    14.4, 11.8, 72.9, 8.6, 19.5, 16.4, 14.2, 32.8, 12, 27.5, 
    9.1, 13, 9.3, 18.1, 11, 10.8, 12.7, 24.6, 13, 13.5), vizit.4.day.23 = c(23.9, 
    14, 11.2, 13.7, 21.1, 10.5, 15.6, 18.6, 13.7, 14.2, 12.4, 
    7.5, 20.9, 15.6, 13.7, 20.7, 8.2, 44, 10.7, 10.3, 32.2, 7, 
    20.2, 11.7, 29, 23.2, 10.7, 23.9, 9.8, 11.4, 9.1, 19.5, 8.7, 
    11.9, 11.7, 11.4, 20, 10.7), vizit.5.day.29 = c(13.5, 16.7, 
    15.4, 14.9, 44, 11, 14.4, 15.6, 11, 12.6, 11.4, 9.4, 26.2, 
    14, 17.4, 18.8, 10.3, 41.2, 12.6, 11.9, 28.5, 8.4, 20.7, 
    12.8, 24.1, 30.6, 13.7, 26.9, 13.5, 11.9, 10, 8.4, 10, 13, 
    12.4, 11.7, 16.3, 11.2)), class = "data.frame", row.names = c(NA, 
-38L))

ast=structure(list(groupter = c(1L, 2L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 
1L, 2L, 1L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 
2L, 2L, 1L, 1L, 2L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 1L), screen = c(13.2, 
25.4, 14.5, 19.5, 14.9, 17.6, 13, 17.6, 14.5, 16.9, 12.5, 10.6, 
45, 23.2, 18.1, 32.2, 14.7, 40.9, 12.5, 21, 28.5, 19.5, 23.2, 
22.9, 19.3, 43.1, 17.4, 17.6, 13, 29.7, 11.3, 34.6, 11.1, 23.7, 
8.2, 29, 17.4, 14.2), vizit.1.day.2 = c(17.1, 10.8, 10.3, 22, 
12.3, 10.8, 12.3, 17.6, 16.4, 15.9, 12, 11.3, 23.7, 16.4, 34.3, 
16.2, 10.3, 33.1, 11.3, 14.9, 33.1, 15.4, 24.2, 18.1, 18.8, 28, 
16.4, 30.9, 13.2, 17.8, 10.6, 19.1, 11.5, 16.4, 14.2, 12.5, 21, 
10.6), vizit.2.day.9 = c(11.1, 17.4, 12.5, 22.9, 21, 13.5, 13.5, 
14.5, 15.9, 18.1, 11.8, 12, 22.7, 15.4, 17.4, 35.5, 10.3, 24.6, 
13, 13.5, 33.6, 12.3, 20.3, 17.4, 13.2, 31.7, 13, 29, 15.4, 18.1, 
10.8, 23.7, 12.5, 14.2, 12.5, 11.5, 14.9, 15.9), vizit.3.day.16 = c(23.7, 
13.7, 23.4, 17.4, 13.2, 12, 11.8, 14, 14, 13.5, 10.6, 9.6, 18.8, 
13.2, 14.5, 23.4, 9.4, 27.8, 15.4, 12.8, 73.9, 9.6, 20.5, 17.4, 
15.2, 33.8, 13, 28.5, 10.1, 14, 10.3, 19.1, 12, 11.8, 13.7, 25.6, 
14, 14.5), vizit.4.day.23 = c(24.9, 15, 12.2, 14.7, 22.1, 11.5, 
16.6, 19.6, 14.7, 15.2, 13.4, 8.5, 21.9, 16.6, 14.7, 21.7, 9.2, 
45, 11.7, 11.3, 33.2, 8, 21.2, 12.7, 30, 24.2, 11.7, 24.9, 10.8, 
12.4, 10.1, 20.5, 9.7, 12.9, 12.7, 12.4, 21, 11.7), vizit.5.day.29 = c(14.5, 
17.7, 16.4, 15.9, 45, 12, 15.4, 16.6, 12, 13.6, 12.4, 10.4, 27.2, 
15, 18.4, 19.8, 11.3, 42.2, 13.6, 12.9, 29.5, 9.4, 21.7, 13.8, 
25.1, 31.6, 14.7, 27.9, 14.5, 12.9, 11, 9.4, 11, 14, 13.4, 12.7, 
17.3, 12.2)), class = "data.frame", row.names = c(NA, -38L))

How for each variable separately calculate kruskal.test ? I can do it manually

kruskal.test(screen ~ groupter, data = alt)

then

kruskal.test(vizit.1.day.2 ~ groupter, data = alt)

and so on. But this is very inconvenient when there are many variables in the dataset. Is there a way to calculate all variables at once? To get p - value for each variable without writing the same command a hundred times?

Also as I said above, there are many datasets in the working directory, ie I need to calculate for each dataset the same principle. The name of the grouping variable groupter is the same everywhere. First, we take the first dataset, calculate the p-value for its variables, then we take the second dataset and calculate the p-value for its variables. How can you achieve this desired result? for alt

    vizit 1 day 2   vizit 2 day 9   vizit 3 day 16  vizit 4 day 23  vizit 5 day 29  screen
p-value     0,05    0,05            0,05            0,05          0,05         0,05

and for ast

    vizit 1 day 2   vizit 2 day 9   vizit 3 day 16  vizit 4 day 23  vizit 5 day 29  screen
p-value 0,04           0,04        0,04          0,04            0,04         0,04

and for another datasets if it there is in workdir? Any help is appreciated, thanks

Answer 1

You can calculate all required p-values for one data.frame like this:

p <- unlist(lapply(2:ncol(df), function(x) {kruskal.test(df[,x] ~ df[,1])$p.value}))

If you want to save it nicely you can do:

result <- as.data.frame(matrix(p, nrow=1))
names(result) <- names(df)[-1]

> result (for ast)
      screen vizit.1.day.2 vizit.2.day.9 vizit.3.day.16 vizit.4.day.23 vizit.5.day.29
1 0.02270054     0.1836747     0.2861786      0.1480187      0.1113489      0.2427368

You could list all files in the directory with list.files() and then loop through the files. After reading, you can directly apply the above code and save the output in a list or you can create an empty matrix before starting the loop and save the created vector row by row into that matrix.

Answer 2

This isn't a particularly elegant solution, but has the advantage of not relying on your group column always being the first column, and allows for some flexibility in which columns are used as responses (by default, it will use any column that isn't the grouping column).

library(dplyr) # for `bind_rows`
library(broom) # for `tidy`

response_vars <- function(data, group){
  # returns a vector of names in data that are no the group name.
  names(data)[!names(data) %in% group]
}

kruskal_tests <- function(data, 
                          group,
                          response = response_vars(data, group),
                          ...){
  require(broom)
  if (length(group) != 1){
    # this isn't written to work with multiple grouping variables supplied.
    # stop the function if necessary.
    stop("`group` must have length 1")
  }
  
  # Create a list of formulas for the `kruskal.test` function.
  formula_list <- 
    lapply(response, 
           function(r){
             as.formula(sprintf("%s ~ %s", r, group))
           })
  
  # Run `kruskal.test` for each formula
  kruskal_result <- 
    lapply(formula_list, 
           kruskal.test,
           data = data, 
           ...)
  
  # Convert the results to a data frame
  kruskal_result <- lapply(kruskal_result, 
                           tidy)
  
  # Compile all of the results from each test into a data frame.
  kruskal_result <- do.call("rbind", kruskal_result)
  
  # Add the response variable to the data frame.
  kruskal_result$response <- response
  
  kruskal_result
}

An example of running this on a single data frame:

kruskal_tests(alt, "groupter")

To do this for multiple data frames, now, we want to have our data frames in a named list (a named list is important as it will let us recall which results came from which data frame, in case the data frames have similar columns).

How you get them into a named list may vary. For instance, if you use lapply to read in the files, you could use the filename as the names for the list. I'll use mget after having loaded your ast and alt data frames.

# Puts the data frames in a named list.
# The named list is key to getting the .id argument in `bind_rows` to 
#    behave the way we want.
data_list <- mget(c("alt", "ast"))


lapply(data_list, 
       kruskal_tests, 
       group = "groupter") %>% 
  bind_rows(.id = "dataset")

How calculate kruskal.test for each variable for many datasets in R and save p-value

Question

2 answers

solution1
2 2022-01-05 12:23:23

solution2
1 2022-01-05 12:32:14

How calculate kruskal.test for each variable for many datasets in R and save p-value

Question

2 answers

solution1 2 2022-01-05 12:23:23

solution2 1 2022-01-05 12:32:14

solution1
2 2022-01-05 12:23:23

solution2
1 2022-01-05 12:32:14