简体   繁体   中英

How Do I Create an Array in R from a Data Frame with 3 Columns?

I currently have a dataframe with three columns, as recreated below:

SNPname AnimalID AlleleFrequency
ARS-BFGL-BAC-10172 1 0.0
ARS-BFGL-BAC-1020 2 0.5
ARS-BFGL-BAC-10345 3 1.0
ARS-BFGL-BAC-10591 4 0.5
and so on... and so on... and so on...

For each animal, I have ~777,000 SNPs and their corresponding allele frequencies. (To be exact, I have 777,962 SNPs on 52 animals for a total of 40,454,024 observations).

Basically I need to create an array of this data so that my rows are the SNPs, the column is the allele frequency, and the 3rd dimension of the array is the animalID. So in total, I need my dimensions to be [777962 1 52]. However, for the life of me, I cannot figure out how to make this array. I've tried the array command and the abind command, among a few other things out of desperation but I have not had any luck.

This is the code that was originally suggested to me by a friend who knows more about R than I do:

array = abind(df, along = 3)

but that gives me an array with these dimensions: [40454024 2 1] which isn't right.

Here are some other things I've tried that haven't worked:

array = array(data = df$`SNPname`, df$AlleleFrequency, df$`AnimalID`)
array = abind(data = df$`SNPname`, df$AlleleFrequency, df$`AnimalID`)
array = array(c(df$`SNPname`, df$AlleleFrequency), dim =c(df$`SNPname`, df$AlleleFrequency, df$`AnimalID`))

If someone could help point me in the right direction, I would be eternally grateful. Thanks in advance!!

If you mean you need a 3d array with the three columns as dimensions, this means each cell/value will be a count. For this, use xtabs (or table ):

xtabs(~SNPname + AlleleFrequency + AnimalID, data = dat)
# , , AnimalID = 1
#                     AlleleFrequency
# SNPname              0 0.5 1
#   ARS-BFGL-BAC-10172 1   0 0
#   ARS-BFGL-BAC-1020  0   0 0
#   ARS-BFGL-BAC-10345 0   0 0
#   ARS-BFGL-BAC-10591 0   0 0
# , , AnimalID = 2
#                     AlleleFrequency
# SNPname              0 0.5 1
#   ARS-BFGL-BAC-10172 0   0 0
#   ARS-BFGL-BAC-1020  0   1 0
#   ARS-BFGL-BAC-10345 0   0 0
#   ARS-BFGL-BAC-10591 0   0 0
# , , AnimalID = 3
#                     AlleleFrequency
# SNPname              0 0.5 1
#   ARS-BFGL-BAC-10172 0   0 0
#   ARS-BFGL-BAC-1020  0   0 0
#   ARS-BFGL-BAC-10345 0   0 1
#   ARS-BFGL-BAC-10591 0   0 0
# , , AnimalID = 4
#                     AlleleFrequency
# SNPname              0 0.5 1
#   ARS-BFGL-BAC-10172 0   0 0
#   ARS-BFGL-BAC-1020  0   0 0
#   ARS-BFGL-BAC-10345 0   0 0
#   ARS-BFGL-BAC-10591 0   1 0

If you mean that you need the frequency to be the value in each cell and not the count, then while you can create a 3d array for it, it will never have more than 2d of data. One way to get this is with tidyr::pivot_wider :

tidyr::pivot_wider(dat, "SNPname", names_from = "AnimalID", values_from = "AlleleFrequency")
# # A tibble: 4 x 5
#   SNPname              `1`   `2`   `3`   `4`
#   <chr>              <dbl> <dbl> <dbl> <dbl>
# 1 ARS-BFGL-BAC-10172     0  NA      NA  NA  
# 2 ARS-BFGL-BAC-1020     NA   0.5    NA  NA  
# 3 ARS-BFGL-BAC-10345    NA  NA       1  NA  
# 4 ARS-BFGL-BAC-10591    NA  NA      NA   0.5

Data

dat <- structure(list(SNPname = c("ARS-BFGL-BAC-10172", "ARS-BFGL-BAC-1020", "ARS-BFGL-BAC-10345", "ARS-BFGL-BAC-10591"), AnimalID = 1:4,     AlleleFrequency = c(0, 0.5, 1, 0.5)), class = "data.frame", row.names = c(NA, -4L))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM