如何根据分位数截止值过滤矩阵中的基因？

Question

This is a matrix with some example data: 这是一个包含一些示例数据的矩阵：

                  S1        S2       S3
ARHGEF10L       11.1818   11.0186  11.243
HIF3A            5.2482   5.3847   4.0013
RNF17            4.1956      0        0
RNF10            11.504   11.669.  12.0791
RNF11            9.5995   11.398    9.8248
RNF13            9.6257  10.8249    10.5608
GTF2IP1         11.8053  11.5487    12.1228
REM1             5.6835   3.5408    3.5582
MTVR2               0     1.4714      0
RTN4RL2          8.7486   7.9144    7.9795
C16orf13        11.8009   9.7438    8.9612
C16orf11            0        0         0
FGFR1OP2          7.679   8.7514    8.2857
TSKS             2.3036    2.8491   0.4699

I have a matrix "h" with 10,000 genes as rownames and 100 samples as columns. 我有一个矩阵“ h”，其中有10,000个基因作为行名，而100个样本作为列。 I need to select top 20% highly variable genes for clustering. 我需要选择排名前20％的高度可变的基因进行聚类。 But I'm not sure about what I gave is right or not. 但是我不确定我给的是对还是错。

So, for this filtering I have used genefilter R package . 因此，对于此过滤，我使用了genefilter R软件包。

varFilter(h, var.func=IQR, var.cutoff=0.8, filterByQuantile=TRUE)

Do you think the command which I gave is right to get top 20% highly variable genes? 您认为我给出的命令是否正确才能获得前20％的高度可变基因？ And can anyone please tell me how this method works in a statistical way? 谁能告诉我这种方法如何以统计方式起作用？

Answer 1

I haven't used this package myself, but the helpfile of the function you're using makes the following remark: 我自己没有使用过此包，但是您正在使用的函数的帮助文件有以下说明：

IQR is a reasonable variance-filter choice when the dataset is split into two roughly equal and relatively homogeneous phenotype groups. 当数据集分为两个大致相等且相对同质的表型组时，IQR是一个合理的方差过滤器选择。 If your dataset has important groups smaller than 25% of the overall sample size, or if you are interested in unusual individual-level patterns, then IQR may not be sensitive enough for your needs. 如果数据集中的重要组小于总样本量的25％，或者您对不寻常的个人级别模式感兴趣，则IQR可能不够敏感，无法满足您的需求。 In such cases, you should consider using less robust and more sensitive measures of variance (the simplest of which would be sd). 在这种情况下，您应该考虑使用不太可靠且比较敏感的方差度量（最简单的方法是sd）。

Since your data has a bunch of small groups, it might be wise to follow this advice to change your var.func to var.func = sd . 由于您的数据有很多小组，因此遵循此建议将var.func更改为var.func = sd可能是明智的。

sd computes the standard deviation , which should be easy to understand. sd计算标准偏差，这应该很容易理解。

However , this function expects its data in the form of an expressionSet object. 但是，此函数期望其数据以expressionSet对象的形式出现。 The error message you got ( Error in (function (classes, fdef, mtable) : unable to find an inherited method for function 'exprs' for signature '"matrix"' ) implies that you don't have that, but just a plain matrix instead. 您收到的错误消息（ Error in (function (classes, fdef, mtable) : unable to find an inherited method for function 'exprs' for signature '"matrix"' ）意味着您没有这个，而只是简单的矩阵代替。

I don't know how to create an expressionSet , but I think that doing that is overly complicated anyways. 我不知道如何创建一个expressionSet ，但是我认为这样做太复杂了。 So I would suggest going with the code that you posted in the comments: 因此，我建议您使用注释中发布的代码：

vars <- apply(h, 1, sd)
h[vars > quantile(vars, 0.8), ]

如何根据分位数截止值过滤矩阵中的基因？

问题描述

1 个解决方案

解决方案1
3 已采纳 2017-07-13 08:04:32

如何根据分位数截止值过滤矩阵中的基因？

问题描述

1 个解决方案

解决方案1 3 已采纳 2017-07-13 08:04:32

解决方案1
3 已采纳 2017-07-13 08:04:32