简体   繁体   English

在数据子集的不同变量中查找最小值和最大值

[英]Finding min and max values in different variables of data subsets

This is probably very simple for all of you guru programmers, but I'm new at R and I seek some help. 对于你们所有的大师级程序员来说,这可能非常简单,但是我是R的新手,我寻求一些帮助。 First I'll try to describe my data and then ask a question. 首先,我将尝试描述我的数据,然后提出一个问题。 I have 110k obs. 我有110k obs。 of 24 variables: 24个变量中:

      qseqid    evalue    pident    lenght    .......
    1 LL_206    3e-22     65.7      612
    2 LL_206    5e-22     75.6      485
    3 LL_206    5e-14     80.6      598
    4 LL_300    4e-22     90.5      251
    5 LL_300    4e-22     64.7      589
    6 LL_300    8e-14     89.8      125
    .
    .
    .

Now you can see that my data has subsets at qseqid variable. 现在您可以看到我的数据在qseqid变量处有子集。 What I'm trying to get from my data is to find min evalue, max pident and max lenght for each subset of qseqid variable. 我想从我的数据中得到的是找到qseqid变量的每个子集的最小evalue,最大pident和最大长度。

My results should look like this: 我的结果应如下所示:

      qseqid    evalue    pident    lenght    .......
    1 LL_206    3e-22     65.7      612
    2 LL_300    4e-22     90.5      251
    .
    .
    .

I want that results are presented as csv table and in table should also be included all variables. 我希望结果显示为csv表,并且表中还应包含所有变量。 I tried aggregate method, but I don't know how to tell R to find first min evalue then max pident and so on. 我尝试了聚合方法,但是我不知道如何告诉R先找到最小min值,然后找到最大pident等。 Your help would be much appreciated. 您的帮助将不胜感激。

A solution using dplyr : 使用dplyr的解决方案:

library(dplyr)
res <- group_by(your_data_frame, qseqid) %>%
  summarise(evalue = min(evalue), pident = max(pident), 
    length = max(length))

You can then save res using write.csv / write.csv2 . 然后,您可以使用write.csv / write.csv2保存res

The way I can think about it, is to create the function which make a subset of all rows that fulfill first requirement, then subset for the second and so on.. 我可以考虑的方式是创建一个函数,该函数使满足第一个要求的所有行的子集成为一个函数,然后为第二个请求的子集创建一个子集,依此类推。

ComplexSubset <- function(id){
  data.t <- data[data$qseqid %in% id,] # subset the needed qseqid
  data.t <- data.t[data.t$eval %in% min(data.t$eval),]  # subset the ALL min eval
  data.t <- data.t[data.t$pident %in% max(data.t$pident),]  # subset ALL max pident
  data.t <- data.t[data.t$lenght %in% max(data.t$lenght),]  #subset ALL max lenght
#here you can add additional subseting for your remaining colums
  return(data.t) ## save the row ID
}

## And then run for all unique qseqid

data.final <- data[0,]

for(id in unique(data$qseqid)){
  data.final <- rbind(data.final, ComplexSubset(id))
}

This should work. 这应该工作。 Hope it helpful 希望对您有所帮助

Ok, this is not 100% what I want as output. 好的,这不是我想要作为输出的100%。 But it's close... 但是很近...

qseqid <- c("LL_1", "LL_1", "LL_1", "LL_1", "LL_2", "LL_2", "LL_2", "LL_2")
evalue <- c(1e-34, 1e-34, 1e-34, 1e-25, 2e-85, 2e-85, 3e-85, 1e-80)
pident <- c(90.5, 90.5, 80.8, 90.5, 75.3, 85.6, 75.3, 65.2)
lenght <- c(485, 503, 897, 1052, 689, 4859, 50, 115)
title <- c("A", "B", "C", "D", "E", "F", "G", "H")
mojadata <- data.frame(qseqid, evalue, pident, lenght, title)

Data: 数据:

   qseqid evalue pident lenght title
1:   LL_1  1e-34   90.5    485     A
2:   LL_1  1e-34   90.5    503     B
3:   LL_1  1e-34   80.8    897     C
4:   LL_1  1e-25   90.5   1052     D
5:   LL_2  2e-85   75.3    689     E
6:   LL_2  2e-85   85.6   4859     F
7:   LL_2  3e-85   75.3     50     G
8:   LL_2  1e-80   65.2    115     H

Code: 码:

mojadata = data.table (mojadata)
mojadataout<-mojadata[, list(evalue=min(evalue),
       pident=max(pident[evalue == min(evalue)]), 
       lenght=max(lenght[pident==max(pident)&evalue==min(evalue)])), 
       by=list(qseqid)]

Output (what I get): 输出(我得到的):

   qseqid evalue pident lenght
1:   LL_1  1e-34   90.5    503
2:   LL_2  2e-85   85.6   4859 

Output (what I want): 输出(我想要的):

   qseqid evalue pident lenght title
1:   LL_1  1e-34   90.5    503     B
2:   LL_2  2e-85   85.6   4859     F

Now, how do I put in output variable "title"? 现在,如何输入输出变量“ title”? I hope this time my question is more understandable. 我希望这次我的问题更容易理解。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM