[英]How to filter genes in matrix based on quantile cutoff?
This is a matrix with some example data: 这是一个包含一些示例数据的矩阵:
S1 S2 S3
ARHGEF10L 11.1818 11.0186 11.243
HIF3A 5.2482 5.3847 4.0013
RNF17 4.1956 0 0
RNF10 11.504 11.669. 12.0791
RNF11 9.5995 11.398 9.8248
RNF13 9.6257 10.8249 10.5608
GTF2IP1 11.8053 11.5487 12.1228
REM1 5.6835 3.5408 3.5582
MTVR2 0 1.4714 0
RTN4RL2 8.7486 7.9144 7.9795
C16orf13 11.8009 9.7438 8.9612
C16orf11 0 0 0
FGFR1OP2 7.679 8.7514 8.2857
TSKS 2.3036 2.8491 0.4699
I have a matrix "h" with 10,000 genes as rownames and 100 samples as columns. 我有一个矩阵“ h”,其中有10,000个基因作为行名,而100个样本作为列。 I need to select top 20% highly variable genes for clustering.
我需要选择排名前20%的高度可变的基因进行聚类。 But I'm not sure about what I gave is right or not.
但是我不确定我给的是对还是错。
So, for this filtering I have used genefilter R package . 因此,对于此过滤,我使用了genefilter R软件包 。
varFilter(h, var.func=IQR, var.cutoff=0.8, filterByQuantile=TRUE)
Do you think the command which I gave is right to get top 20% highly variable genes? 您认为我给出的命令是否正确才能获得前20%的高度可变基因? And can anyone please tell me how this method works in a statistical way?
谁能告诉我这种方法如何以统计方式起作用?
I haven't used this package myself, but the helpfile of the function you're using makes the following remark: 我自己没有使用过此包,但是您正在使用的函数的帮助文件有以下说明:
IQR is a reasonable variance-filter choice when the dataset is split into two roughly equal and relatively homogeneous phenotype groups.
当数据集分为两个大致相等且相对同质的表型组时,IQR是一个合理的方差过滤器选择。 If your dataset has important groups smaller than 25% of the overall sample size, or if you are interested in unusual individual-level patterns, then IQR may not be sensitive enough for your needs.
如果数据集中的重要组小于总样本量的25%,或者您对不寻常的个人级别模式感兴趣,则IQR可能不够敏感,无法满足您的需求。 In such cases, you should consider using less robust and more sensitive measures of variance (the simplest of which would be sd).
在这种情况下,您应该考虑使用不太可靠且比较敏感的方差度量(最简单的方法是sd)。
Since your data has a bunch of small groups, it might be wise to follow this advice to change your var.func
to var.func = sd
. 由于您的数据有很多小组,因此遵循此建议将
var.func
更改为var.func = sd
可能是明智的。
sd
computes the standard deviation , which should be easy to understand. sd
计算标准偏差 ,这应该很容易理解。
However , this function expects its data in the form of an expressionSet
object. 但是 ,此函数期望其数据以
expressionSet
对象的形式出现。 The error message you got ( Error in (function (classes, fdef, mtable) : unable to find an inherited method for function 'exprs' for signature '"matrix"'
) implies that you don't have that, but just a plain matrix instead. 您收到的错误消息(
Error in (function (classes, fdef, mtable) : unable to find an inherited method for function 'exprs' for signature '"matrix"'
)意味着您没有这个,而只是简单的矩阵代替。
I don't know how to create an expressionSet
, but I think that doing that is overly complicated anyways. 我不知道如何创建一个
expressionSet
,但是我认为这样做太复杂了。 So I would suggest going with the code that you posted in the comments: 因此,我建议您使用注释中发布的代码:
vars <- apply(h, 1, sd)
h[vars > quantile(vars, 0.8), ]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.