简体   繁体   中英

Extracting data (or reshaping) a data frame from an existing data frame in R

I have a large data frame that Im working with, the first few lines are as follows:

      Assay   Genotype   Sample    Result
1     001        G         1         0
2     001        A         2         1
3     001        G         3         0 
4     001        NA        4         NA
5     002        T         1         0
6     002        G         2         1
7     002        T         3         0 
8     002        T         4         0
9     003        NA        1         N
10    003        G         2         1
11    003        G         3         1 
12    003        T         4         0

In total I'll be working with 2000 samples and 168 Assays for each sample. For each sample, Id like extract the data in 'Result' for each sample to create either a list or data frame that looks something like this:

Sample  Data
   1    00N
   2    111
   3    001
   4    N00

The resulting data frame (or similar preferred data structure) would thus be 2000 rows and 2 columns. The 'Data' line would contain 168 characters each one for each 'Assay'.

Can somebody help me with this problem?

Base R solution using split and sapply :

sapply(split(dat$Result, dat$Sample), paste, collapse="")

     1      2      3      4 
 "00N"  "111"  "001" "NA00" 

One approach with package plyr and base function paste :

library(plyr)
ddply(dat, "Sample", summarize, Data = paste(Result, collapse = ""))

  Sample Data
1      1  00N
2      2  111
3      3  001
4      4 NA00

EDIT to address question

Probably the easiest way I can think of to change your NA to N is to use gsub on the result of ddply . Note I'm liberally borrowing the very good point provided by @Brian re: ordering. Do that, it's a good tip!

out <- ddply(dat, "Sample", summarize, Data = paste(Result[order(Assay)], collapse = ""))

Then use gsub

out$Data <- gsub("NA", "N", out$Data)

et voila:

  Sample Data
1      1  00N
2      2  111
3      3  001
4      4  N00

Note that @Chase and @Andrie both assume that the data is already sorted by assay (which your example is, so not an unreasonable assumption). If it is not, you can still get the string in the proper order.

Adapting @Chase's solution

library(plyr)
ddply(dat, "Sample", summarize, 
  Data = paste(Result[order(Assay)], collapse = ""))

gives

  Sample Data
1      1  00N
2      2  111
3      3  001
4      4 NA00

If we use data which is not sorted:

dat.scramble <- dat[sample(nrow(dat)),]

> dat.scramble
   Assay Genotype Sample Result
6    002        G      2      1
1    001        G      1      0
3    001        G      3      0
7    002        T      3      0
10   003        G      2      1
8    002        T      4      0
12   003        T      4      0
5    002        T      1      0
2    001        A      2      1
4    001       NA      4     NA
9    003       NA      1      N
11   003        G      3      1

we still get the same result

ddply(dat.scramble, "Sample", summarize, 
  Data = paste(Result[order(Assay)], collapse = ""))

  Sample Data
1      1  00N
2      2  111
3      3  001
4      4 NA00

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM