定义选择数据的功能

Question

Let's start with my data. 让我们从我的数据开始。

    > dput(head(tbl_ready)) ## To make it clear I didn't put all of the row names
structure(list(Gene_name = structure(1:6, .Label = c("AT1G01050", 
"AT1G01080", "AT1G01090", "AT1G01220", "AT1G01320", "AT1G01420", 
"AT1G01470", "AT1G01800", "AT1G01910", "AT1G01920", "AT1G01960", 
"AT5G66570", "AT5G66720", "AT5G66760", "AT5G67150", "AT5G67360", 
"ATCG00120", "ATCG00160", "ATCG00170", "ATCG00190", "ATCG00380", 
"ATCG00470", "ATCG00480", "ATCG00490", "ATCG00500", "ATCG00650", 
"ATCG00660", "ATCG00670", "ATCG00750", "ATCG00770", "ATCG00780", 
"ATCG00800", "ATCG00810", "ATCG00820", "ATCG01090", "ATCG01110", 
"ATCG01120", "ATCG01240", "ATCG01300", "ATCG01310", "ATMG01190"
), class = "factor"), `10` = c(0, 0, 0, 0, 0, 0), `20` = c(0, 
0, 0, 0, 0, 0), `52.5` = c(0, 1, 0, 0, 0, 0), `81` = c(0, 0.660693687777888, 
0, 0, 0, 0), `110` = c(0, 0.521435654491704, 0, 0, 0, 1), `140.5` = c(0, 
0.437291194705566, 0, 0, 0, 1), `189` = c(0, 0.52204783488213, 
0, 0, 0, 0), `222.5` = c(0, 0.524298383907171, 0, 0, 0, 0), `278` = c(1, 
0.376865096972469, 0, 1, 0, 0), `340` = c(0, 0, 0, 0, 0, 0), 
    `397` = c(0, 0, 0, 0, 0, 0), `453.5` = c(0, 0, 0, 0, 0, 0
    ), `529` = c(0, 0, 0, 0, 0, 0), `580` = c(0, 0, 0, 0, 0, 
    0), `630.5` = c(0, 0, 0, 0, 0, 0), `683.5` = c(0, 0, 0, 0, 
    0, 0), `735.5` = c(0, 0, 0, 0, 0, 0), `784` = c(0, 0, 0.476101907006443, 
    0, 0, 0), `832` = c(0, 0, 1, 0, 0, 0), `882.5` = c(0, 0, 
    0, 0, 0, 0), `926.5` = c(0, 0, 0, 0, 1, 0), `973` = c(0, 
    0, 0, 0, 0, 0), `1108` = c(0, 0, 0, 0, 0, 0), `1200` = c(0, 
    0, 0, 0, 0, 0)), .Names = c("Gene_name", "10", "20", "52.5", 
"81", "110", "140.5", "189", "222.5", "278", "340", "397", "453.5", 
"529", "580", "630.5", "683.5", "735.5", "784", "832", "882.5", 
"926.5", "973", "1108", "1200"), row.names = c(NA, 6L), class = "data.frame")

Take a look on the names of the columns (just picked the 6 of them): 看一看列的名称（刚刚选择了其中的6个）：

Those names tell me the size range. 这些名字告诉我大小范围。 The size of the genes in the first column starts from 10 and ends on the begining of the second column = 20. That means that to the first column should belong genes with the size between 10-20. 第一列中的基因大小从10开始，到第二列的开始处=20。这意味着到第一列的基因应该属于10-20之间的基因。

I have another table which tells me what's the size of all genes (there are much more than can be finded in my first table): 我还有另一个表格可以告诉我所有基因的大小（远远超过我的第一个表格中可以找到的大小）：

    >dput(head(tbl_size))
    structure(list(Gene_name = structure(1:6, .Label = c("ATMG01290", "ATMG01300", "ATMG01310", "ATMG01320", "ATMG01330", 
    "ATMG01350", "ATMG01360", "ATMG01370", "ATMG01400", "ATMG01410"
    ), class = "factor"), tp = c(26L, 17L, 22L, 142L, 12L, 45L), 
        size = c(49.4255, 28.0913, 40.2872, 213.572, 24.4838, 70.4375
        )), .Names = c("locus", "tp", "size"), row.names = c(NA, 

6L), class = "data.frame")

and now the main part. 现在是主要部分。 What I want to achieve with my code ? 我想用我的代码实现什么？

So, I'm trying to find only those genes which are found in the fractions (columns) with the size range two times higher than a real size of the gene. 因此，我试图仅找到那些在片段（列）中发现的基因，其大小范围比该基因的实际大小高两倍。 No idea if you understand what I am trying to do so let me use an example. 不知道您是否了解我正在尝试做什么，让我举个例子。

so let's say that we have a genes: 假设我们有一个基因：

  Names      Size      
    AT1G01080     40
    AT1G01090     30
    AT1G01220     50

Let's multiply the size by 2: 让我们将大小乘以2：

    Names        Size      
    AT1G01080     80
    AT1G01090     60
    AT1G01220     100

In first table ( tbl_ready ) we can find the list of the genes and specific fractions (columns) defined by size which I explained in the begining of this thread. 在第一个表（ tbl_ready ）中，我们可以找到由大小定义的基因和特定部分（列）的列表，我在本主题的开头对此做了解释。 I would like to put the 0 instead of any values if any gene can be found in the fraction (column) which is not atleast two times higher than the gene size. 如果要在分数（列）中找到至少不大于基因大小两倍的基因，我想用0代替任何值。

To find the size of the gene you have to look in the second table ( tbl_size ). 要查找基因的大小，您必须在第二张表（ tbl_size ）中tbl_size 。

Just to sum it up. 总结一下。 I'm trying to define which of those genes come atleast as a complex of 2. So only fractions with size two times higher than the size of the gene are important for me. 我正在尝试定义那些基因中至少有2个复合物的基因。因此，对于我来说，只有比该基因大小大两倍的片段才是重要的。

IF SOMEONE KNOWS WHAT I AM TRYING TO DO PLEASE EDIT MY QUESTION TO MAKE IT READABLE. 如果有人知道我要做什么，请编辑我的问题以使其可读。 I FEEL LIKE MY BRAIN IS DEAD. 我觉得我的大脑已经死了。

Answer 1

Firstly, convert the columns to their numerical value: 首先，将列转换为其数值：

frac <- as.numeric(colnames(tbl_ready))

and then get the index per gene of the column that doesn't exceed it's frac by two-fold: 然后获得不超过两倍的列每个基因的索引：

ind <- lapply(tbl_size$size, function(x) which(frac > x*2)[1]-1)

Then you can create an array index of the values that you need to set to zero: 然后，您可以创建需要设置为零的值的数组索引：

rowI = rep(match(tbl_size$locus, tbl_ready$Gene_name), times=ind-1)
colI = unlist(mapply(seq, from=2, length=ind-1))
tbl_ready[cbind(rowI, colI)] <- 0

You'll have to be careful if gene_names don't have a 1:1 mapping with locus, and cases where none of the columns exceed the gene size two fold, as there'll be NAs that need dealing with. 如果gene_names与基因座不具有1：1映射，并且在没有任何一列超出基因大小两倍的情况下，您将要小心，因为将需要处理NA。 I'm assuming you're stuck using these representations of your data, as it would probably be better to store tbl_ready in a longer narrower form than you have it here (containing only three columns name, size, and value - and omitted the zero values). 我假设您被困在使用数据的这些表示形式上，因为将tbl_ready存储在比您在此处更窄的格式中更好（只包含三列名称，大小和值-并省略零值）。

Answer 2

I'm going to change my original answer, this time using the data you've provided - the only real differences are that you've changed the column names (I'm assuming column tp in tbl_size is the thing we need to match to the column headings in tbl_ready), and that some of the rows in table_size don't correspond to tbl_ready. 这次，我将使用您提供的数据来更改原始答案-真正的区别是您已更改了列名（我假设tbl_size中的tp列是我们需要匹配的内容tbl_ready中的列标题），并且table_size中的某些行与tbl_ready不对应。

Firstly, convert the columns to their numerical value: 首先，将列转换为其数值：

frac <- as.numeric(colnames(tbl_ready))

and then get the index per gene of the column that doesn't exceed it's frac by two-fold: 然后获得不超过两倍的列每个基因的索引：

mapToReady <- tbl_size$locus %in% tbl_ready[[1]]
ind <- sapply(tbl_size$tp[mapToReady], function(x) which(frac > x*2)[1]-1)

Then you can create an array index of the values that you need to set to zero: 然后，您可以创建需要设置为零的值的数组索引：

rowI = rep(match(tbl_size$locus[mapToReady], tbl_ready[[1]]), times=ind-1)
colI = unlist(mapply(seq, from=2, length=ind-1))
tbl_ready[cbind(rowI, colI)] <- 0

So, for instance, AT1G01050 is the 5th row of tbl_size (none of the previous entries have an entry in your tbl_size), and the first row of tbl_ready. 因此，例如，AT1G01050是tbl_size的第5行（前面的条目中都没有tbl_size中的条目），而tbl_ready是第一行。 So the first 'iteration' of the sapply line hits 'tbl_size$tp[mapToReady][1]' which is the tp of AT1G01050 which is 12. 2*12 is 24, so is between 20.0 and 52.5, so we're going to need to set columns corresponding to '10', and '20' to zero, but not columns '52.5' onwards, for the AT1G01050. 因此，sapply行的第一个“迭代”命中“ tbl_size $ tp [mapToReady] [1]”，这是AT1G01050的tp，它是12。2* 12是24，所以在20.0和52.5之间，所以我们去对于AT1G01050，需要将与“ 10”对应的列和“ 20”对应的列设置为零，但从52.5开始则不需要。 This corresponds to columns 2 and 3 of row 1 of tbl_ready, which is what the cbind portion of the last three lines is doing. 这对应于tbl_ready第1行的第2列和第3列，这是最后三行的cbind部分正在执行的操作。

定义选择数据的功能

问题描述

2 个解决方案

解决方案1
3 2014-05-13 15:14:22

解决方案2
1 已采纳 2014-05-20 16:25:17

定义选择数据的功能

问题描述

2 个解决方案

解决方案1 3 2014-05-13 15:14:22

解决方案2 1 已采纳 2014-05-20 16:25:17

解决方案1
3 2014-05-13 15:14:22

解决方案2
1 已采纳 2014-05-20 16:25:17