简体   繁体   English

R中许多fisher.test p值的复杂代码

[英]Complex code for many fisher.test p-values in R

I am a beginner in R, so the following is highly complex for me. 我是R的初学者,所以以下对我来说非常复杂。

I have the following data.frame with data from the 5 boroughs of New York city and years 2012-2015. 我有以下数据data.frame其中包含纽约市5个行政区和2012 - 2015年的数据。 For each year, there are two categories: P and Q. 每年有两个类别:P和Q.

Data 数据

 input_df = data.frame(
      Manhattan=c(1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0), 
      Brooklyn=c(0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0), 
      Queens=c(1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0), 
      The_Bronx=c(1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0), 
      Staten_Island=c(0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0), 
      "2012"=c("P", "P", "P", "P", "P", "P", "P", "P", "P", "P", "Q", "Q", "Q", "Q", "Q", "Q", "Q", "Q", "Q"), 
      "2013"=c("P", "P", "P", "P", "P", "P", "P", "P", "Q", "Q", "P", "P", "P", "P", "Q", "Q", "Q", "Q", "Q"), 
      "2014"=c("P", "P", "P", "Q", "Q", "P", "P", "Q", "Q", "Q", "Q", "Q", "P", "Q", "P", "P", "P", "Q", "Q"), 
      "2015"=c("P", "P", "P", "P", "P", "Q", "Q", "Q", "P", "Q", "P", "P", "Q", "Q", "Q", "Q", "Q", "Q", "Q"), 
 check.names=FALSE)

I would like to systematically determine whether in any two boroughs incidents ("1") occurred at the same time for category P more frequently than for category Q (or vice versa), using fisher.test . 我想系统地确定在任何两个行政区域(“1”)是否同时发生事故(“1”)比使用fisher.test更频繁地发生类别Q(反之亦然)。

So, for example: in 2012, did incidents in Manhattan and Brooklyn occur at the same time (both "1" in the same row) more frequently in category P than in category Q? 那么,例如:在2012年,曼哈顿和布鲁克林的事件同时发生(同一行中的“1”)在P类中比在Q类中更频繁发生? This is 4 out of 10 for P and 0 out of 9 for Q, so fisher.test(matrix(c(4,6,0,9), nrow=2))$p.value equals 0.08668731 . 对于P,这是4分(满分10分)和Q分别为9分,因此fisher.test(matrix(c(4,6,0,9), nrow=2))$p.value 0.08668731等于0.08668731

Is there a way to systematically do this? 有没有办法系统地做到这一点? See below for a simple start and my ideal output data.frame . 请参阅下面的简单开始和我理想的输出data.frame I would be happy with anything even close to this output. 即使接近这个输出,我也会很满意。 Thank you. 谢谢。

Code (only a start) 代码(只是一个开始)

 library(reshape2)
 input_df <- melt(input_df, measure.vars = 6:9) # transform the data
 # can maybe use: function x {fisher.test(matrix(x, nrow=2))}
 # how to proceed?

Ideal output 理想的输出

 # ideally hoping to get output similar to this:
 output_df = data.frame(
 borough_1=c("Manhattan", "Manhattan", "Manhattan", "Manhattan", "Manhattan", "Manhattan", "etc"), 
 borough_2=c("Brooklyn", "Brooklyn", "Brooklyn", "Brooklyn", "Queens", "Queens", "etc"),
 year=c("2012", "2013", "2014", "2015", "2012", "2013", "etc"), 
 P_both_boroughs_1=c("4", "2", "1", "2", "4", "4", "etc"), 
 P_not_both_boroughs_1=c("6", "11", "8", "6", "6", "8", "etc"), 
 Q_both_boroughs_1=c("0", "2", "3", "2", "1", "1", "etc"), 
 Q_not_both_boroughs_1=c("9", "5", "7", "9", "8", "6", "etc"), 
 fisher.test.pval=c("0.086687307", "0.586790506", "0.582043344", "1", "0.303405573", "0.602683179", "etc"), 
 check.names=FALSE)

I would solve this problem as follows. 我会解决这个问题如下。 First of all I load the packages I'll use for the analysis 首先,我加载我将用于分析的包

# packages
library(dplyr)
library(tidyr)
library(purrr)

and create the dataset. 并创建数据集。

# data
input_df <- tibble(
  Manhattan = c(1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0),
  Brooklyn = c(0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0),
  Queens = c(1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0),
  The_Bronx = c(1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0),
  Staten_Island = c(0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0),
  "2012" = c("P", "P", "P", "P", "P", "P", "P", "P", "P", "P", "Q", "Q", "Q", "Q", "Q", "Q", "Q", "Q", "Q"),
  "2013" = c("P", "P", "P", "P", "P", "P", "P", "P", "Q", "Q", "P", "P", "P", "P", "Q", "Q", "Q", "Q", "Q"),
  "2014" = c("P", "P", "P", "Q", "Q", "P", "P", "Q", "Q", "Q", "Q", "Q", "P", "Q", "P", "P", "P", "Q", "Q"),
  "2015" = c("P", "P", "P", "P", "P", "Q", "Q", "Q", "P", "Q", "P", "P", "Q", "Q", "Q", "Q", "Q", "Q", "Q")
)
head(input_df)
#> # A tibble: 6 x 9
#>   Manhattan Brooklyn Queens The_Bronx Staten_Island `2012` `2013` `2014`
#>       <dbl>    <dbl>  <dbl>     <dbl>         <dbl> <chr>  <chr>  <chr> 
#> 1         1        0      1         1             0 P      P      P     
#> 2         1        0      1         1             0 P      P      P     
#> 3         0        0      0         0             0 P      P      P     
#> 4         1        1      0         0             0 P      P      Q     
#> 5         1        0      1         0             0 P      P      Q     
#> 6         1        1      1         0             0 P      P      P     
#> # ... with 1 more variable: `2015` <chr>

Then I change your dataset from a wide structure to a long structure. 然后我将您的数据集从宽结构更改为长结构。 The columns year and borough takes the values 2012 , ..., 2015 and Manhattan , ..., Staten_Island while the columns category and flag takes the corresponding values for that combination of borough and year in your dataset. yearborough采用值2012 ,..., 2015Manhattan ,..., Staten_Island而列categoryflag采用数据集中boroughyear组合的相应值。 I need this structure for subsequent functions. 我需要这个结构用于后续功能。

# tidying
tidy_input_df <- input_df %>%
  gather("year", "category", `2012`:`2015`) %>%
  gather("borough", "flag", -category, -year)
tidy_input_df
#> # A tibble: 380 x 4
#>    year  category borough    flag
#>    <chr> <chr>    <chr>     <dbl>
#>  1 2012  P        Manhattan     1
#>  2 2012  P        Manhattan     1
#>  3 2012  P        Manhattan     0
#>  4 2012  P        Manhattan     1
#>  5 2012  P        Manhattan     1
#>  6 2012  P        Manhattan     1
#>  7 2012  P        Manhattan     1
#>  8 2012  P        Manhattan     0
#>  9 2012  P        Manhattan     1
#> 10 2012  P        Manhattan     1
#> # ... with 370 more rows

I'll need also a vector containing the names of all boroughs 我还需要一个包含所有行政区名称的向量

borough <- unique(tidy_input_df$borough)

Now I have to modify your dataset in such a way that, for every year, I have two columns which contains every possible couple of two boroughs (ie Manhattan - Brooklyn, Manhattan - Queens and so on) with the corresponding values. 现在我必须修改你的数据集,每年我都有两列,其中包含两个可能的两个行政区(即曼哈顿 - 布鲁克林,曼哈顿 - 皇后区等)及相应的值。 Since I need to repeat the same procedure for each year I nest the data wrt year 由于我需要每年重复相同的程序,所以我将数据嵌入到年份

nested_input_df <- nest(tidy_input_df, -year)
nested_input_df
#> # A tibble: 4 x 2
#>   year  data             
#>   <chr> <list>           
#> 1 2012  <tibble [95 x 3]>
#> 2 2013  <tibble [95 x 3]>
#> 3 2014  <tibble [95 x 3]>
#> 4 2015  <tibble [95 x 3]>

and create a new function which performes that procedure I described above. 并创建一个新函数,它执行我上面描述的过程。 I can use now the nest - map approach described here . 我现在可以使用这里描述的nest map方法。

The first part of the function create a new column in the dataframe which represent a unique ID for each combination of category and borough, while the second part of the code creates a new dataframe with all combinations of boroughs taken 2 at a time and associate the corresponding values of flag and category (ie 0/1 and P/Q). 函数的第一部分在数据框中创建一个新列,表示每个类别和行政区组合的唯一ID,而代码的第二部分创建一个新的数据框,一次采取2个行政区的所有组合并关联标志和类别的相应值(即0/1和P / Q)。

create_boroughs_combinations <- function(data, borough) {
  # Create a unique ID for all combinations of category
  # and borough
  data <- data %>%
    group_by(category, borough) %>%
    mutate(ID = 1:n()) %>%
    ungroup()

  # Create all combinations of n boroughs taken 2 at a time. 
  t(combn(length(borough), 2)) %>%
  # transorm that matrix in a tibble
    as_tibble(.name_repair = ~ c("borough_1", "borough_2")) %>%
  # associate each matrix value to the corresponding borough name
    mutate(borough_1 = borough[borough_1], borough_2 = borough[borough_2]) %>%
  # join the two dataframes wrt the name of the first borough
    inner_join(data, by = c("borough_1" = "borough")) %>%
  # joint the two dataframes wrt the name of the second column, the category
  # and the unique ID
    inner_join(data, by = c("borough_2" = "borough", "category", "ID")) %>%
  # create a new variable that checks if the incidents occurred at the same time
    mutate(equal = factor(flag.x == 1 & flag.y == 1, levels = c(TRUE, FALSE)))
}

Now I can apply that function to nested_input using the map function. 现在我可以使用map函数将该函数应用于nested_input I have to use map since I need to apply that function separately for each year. 我必须使用map因为我需要每年分别应用该功能。 This is the result. 这是结果。 flag.x is the value of flag for the first borough while flag.y is the value of flag for the second borough. flag.x是第一个行政区的flag值,而flag.y是第二个行政区的flag值。

unnested_input_df <- nested_input_df %>%
  mutate(data = map(data, create_boroughs_combinations, borough = borough)) %>%
  unnest()
unnested_input_df
#> # A tibble: 760 x 8
#>    year  borough_1 borough_2 category flag.x    ID flag.y equal
#>    <chr> <chr>     <chr>     <chr>     <dbl> <int>  <dbl> <fct>
#>  1 2012  Manhattan Brooklyn  P             1     1      0 FALSE
#>  2 2012  Manhattan Brooklyn  P             1     2      0 FALSE
#>  3 2012  Manhattan Brooklyn  P             0     3      0 FALSE
#>  4 2012  Manhattan Brooklyn  P             1     4      1 TRUE 
#>  5 2012  Manhattan Brooklyn  P             1     5      0 FALSE
#>  6 2012  Manhattan Brooklyn  P             1     6      1 TRUE 
#>  7 2012  Manhattan Brooklyn  P             1     7      0 FALSE
#>  8 2012  Manhattan Brooklyn  P             0     8      0 FALSE
#>  9 2012  Manhattan Brooklyn  P             1     9      1 TRUE 
#> 10 2012  Manhattan Brooklyn  P             1    10      1 TRUE 
#> # ... with 750 more rows

Now I can use same idea and create a new function which estimates the pvalue of the fisher test and apply it to every combination of year and couple of boroughs. 现在我可以使用相同的想法并创建一个新的函数来估计Fisher测试的p值,并将其应用于每年和几个行政区的每个组合。 I nest again my data: 我再次嵌套我的数据:

nested_input_df <- unnested_input_df %>%
  nest(-year, -borough_1, -borough_2)
nested_input_df
#> # A tibble: 40 x 4
#>    year  borough_1 borough_2     data             
#>    <chr> <chr>     <chr>         <list>           
#>  1 2012  Manhattan Brooklyn      <tibble [19 x 5]>
#>  2 2012  Manhattan Queens        <tibble [19 x 5]>
#>  3 2012  Manhattan The_Bronx     <tibble [19 x 5]>
#>  4 2012  Manhattan Staten_Island <tibble [19 x 5]>
#>  5 2012  Brooklyn  Queens        <tibble [19 x 5]>
#>  6 2012  Brooklyn  The_Bronx     <tibble [19 x 5]>
#>  7 2012  Brooklyn  Staten_Island <tibble [19 x 5]>
#>  8 2012  Queens    The_Bronx     <tibble [19 x 5]>
#>  9 2012  Queens    Staten_Island <tibble [19 x 5]>
#> 10 2012  The_Bronx Staten_Island <tibble [19 x 5]>
#> # ... with 30 more rows

define the function: 定义功能:

run_fisher_test <- function(data) {
  data <- data %>%
    select(category, equal)

  fisher.test(table(data))$p.value
}

apply it and this is the result: 应用它,这是结果:

result <- nested_input_df %>%
  mutate(p.value = map_dbl(data, run_fisher_test)) %>%
  select(-data)
result
#> # A tibble: 40 x 4
#>    year  borough_1 borough_2     p.value
#>    <chr> <chr>     <chr>           <dbl>
#>  1 2012  Manhattan Brooklyn       0.0867
#>  2 2012  Manhattan Queens         0.303 
#>  3 2012  Manhattan The_Bronx      0.303 
#>  4 2012  Manhattan Staten_Island  1     
#>  5 2012  Brooklyn  Queens         1     
#>  6 2012  Brooklyn  The_Bronx      1     
#>  7 2012  Brooklyn  Staten_Island  1     
#>  8 2012  Queens    The_Bronx      0.350 
#>  9 2012  Queens    Staten_Island  1     
#> 10 2012  The_Bronx Staten_Island  1     
#> # ... with 30 more rows

Created on 2019-09-10 by the reprex package (v0.3.0) reprex包创建于2019-09-10(v0.3.0)

I hope it's clear. 我希望它很清楚。 Comment on this post if you have any doubt. 如果您有任何疑问,请对此帖发表评论。 I know this is not the easiest approach but I really like the nest - map approach and it's quite flexible if you understand it. 我知道这不是最简单的方法,但我真的很喜欢nest map方法,如果您了解它,它会非常灵活。

Here is my attempt using for loops 这是我尝试使用for循环

res=vector("list",4)
names(res)=colnames(input_df)[6:9]
for (k in 1:4) { #years
  res[[k]]=matrix(NA,5,5)
  rownames(res[[k]])=colnames(res[[k]])=colnames(input_df)[1:5]
  for (i in 1:4) { #first in par
    for (j in (i+1):5) { #second in pair
      tmp1=which(input_df[,k+5]=="P")
      tmp2=which(input_df[,k+5]=="Q")
      tmp3=table(input_df[tmp1,i],input_df[tmp1,j]) #table for P
      tmp4=table(input_df[tmp2,i],input_df[tmp2,j]) #table for Q
      tmp5=matrix(c(tmp3[2,2],sum(tmp3)-tmp3[2,2],
                    tmp4[2,2],sum(tmp4)-tmp4[2,2]),nrow=2,byrow=T)
      res[[k]][i,j]=fisher.test(tmp5)$p.value
    }
  }
}

and the output of all the p-values 和所有p值的输出

res

$`2012`
              Manhattan   Brooklyn    Queens The_Bronx Staten_Island
Manhattan            NA 0.08668731 0.3034056 0.3034056             1
Brooklyn             NA         NA 1.0000000 1.0000000             1
Queens               NA         NA        NA 0.3498452             1
The_Bronx            NA         NA        NA        NA             1
Staten_Island        NA         NA        NA        NA            NA

$`2013`
              Manhattan  Brooklyn    Queens  The_Bronx Staten_Island
Manhattan            NA 0.6026832 0.6026832 0.30469556     0.3684211
Brooklyn             NA        NA 1.0000000 0.03611971     0.3684211
Queens               NA        NA        NA 1.00000000     1.0000000
The_Bronx            NA        NA        NA         NA     0.1228070
Staten_Island        NA        NA        NA         NA            NA

$`2014`
              Manhattan  Brooklyn    Queens The_Bronx Staten_Island
Manhattan            NA 0.5820433 0.1408669 0.6284830             1
Brooklyn             NA        NA 0.2105263 1.0000000             1
Queens               NA        NA        NA 0.3498452             1
The_Bronx            NA        NA        NA        NA             1
Staten_Island        NA        NA        NA        NA            NA

$`2015`
              Manhattan Brooklyn    Queens The_Bronx Staten_Island
Manhattan            NA        1 0.6026832 0.6026832     0.4210526
Brooklyn             NA       NA 0.4853801 1.0000000     0.4210526
Queens               NA       NA        NA 0.3188854     1.0000000
The_Bronx            NA       NA        NA        NA     1.0000000
Staten_Island        NA       NA        NA        NA            NA

Alternatively if you want it all in one data frame with additional information 或者,如果您希望在一个数据框中包含所有内容并附加其他信

res=matrix(NA,4*choose(5,2),8)
colnames(res)=c("borough_1","borough_2","year","P_both_boroughs_1",
                "P_not_both_boroughs_1","Q_both_boroughs_1",
                "Q_not_both_boroughs_1","fisher.test.pval")
m=1
for (k in 1:4) { #years
  for (i in 1:4) { #first in par
    for (j in (i+1):5) { #second in pair
      tmp1=which(input_df[,k+5]=="P")
      tmp2=which(input_df[,k+5]=="Q")
      tmp3=table(input_df[tmp1,i],input_df[tmp1,j]) #table for P
      tmp4=table(input_df[tmp2,i],input_df[tmp2,j]) #table for Q
      tmp5=matrix(c(tmp3[2,2],sum(tmp3)-tmp3[2,2],
                    tmp4[2,2],sum(tmp4)-tmp4[2,2]),nrow=2,byrow=T)
      res[m,]=c(colnames(input_df)[i],
                colnames(input_df)[j],
                colnames(input_df)[k+5],
                tmp5[1,1],tmp5[1,2],tmp5[2,1],tmp5[2,2],
                fisher.test(tmp5)$p.value)
      m=m+1
    }
  }
}

and first few rows of output 和前几行输出

data.frame(res)

  borough_1     borough_2 year P_both_boroughs_1 P_not_both_boroughs_1
1 Manhattan      Brooklyn 2012                 4                     6
2 Manhattan        Queens 2012                 4                     6
3 Manhattan     The_Bronx 2012                 4                     6
4 Manhattan Staten_Island 2012                 1                     9
5  Brooklyn        Queens 2012                 1                     9
6  Brooklyn     The_Bronx 2012                 2                     8
  Q_both_boroughs_1 Q_not_both_boroughs_1   fisher.test.pval
1                 0                     9 0.0866873065015479
2                 1                     8  0.303405572755418
3                 1                     8  0.303405572755418
4                 0                     9                  1
5                 1                     8                  1
6                 1                     8                  1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM