[英]How to count number of instances above a value within a given range in R?
I have a rather large dataset looking at SNPs across an entire genome.我有一个相当大的数据集来查看整个基因组的 SNP。 I am trying to generate a heatmap that scales based on how many SNPs have a BF (bayes factor) value over 50 within a sliding window of x base pairs across the genome.
我正在尝试生成一个热图,该热图基于在整个基因组中 x 个碱基对的滑动 window 中有多少 SNP 具有超过 50 的 BF(贝叶斯因子)值。 For example, there might be 5 SNPs of interest within the first 1,000,000 base pairs, and then 3 in the next 1,000,000, and so on until I reach the end of the genome, which would be used to generate a single row heatmap.
例如,前 1,000,000 个碱基对中可能有 5 个感兴趣的 SNP,然后在接下来的 1,000,000 个中可能有 3 个,依此类推,直到我到达基因组的末尾,这将用于生成单行热图。 Currently, my data are set out like so:
目前,我的数据是这样设置的:
SNP BF BP
0001_107388 11.62814713 107388
0001_193069 2.333472447 193069
0001_278038 51.34452334 278038
0001_328786 5.321968927 328786
0001_523879 50.03245434 523879
0001_804477 -0.51777189 804477
0001_990357 6.235452787 990357
0001_1033297 3.08206707 1033297
0001_1167609 -2.427835577 1167609
0001_1222410 52.96447989 1222410
0001_1490205 10.98099565 1490205
0001_1689133 3.75363951 1689133
0001_1746080 3.519987207 1746080
0001_1746450 -2.86666016 1746450
0001_1777011 0.166999413 1777011
0001_2114817 3.266942137 2114817
0001_2232084 50.43561123 2232084
0001_2332903 -0.15022324 2332903
0001_2347062 -1.209000033 2347062
0001_2426273 1.230915683 2426273
where SNP = the SNP ID, BF = the bayes factor, and BP = the position on the genome (I've fudged a couple of > 50 values in there for the data to be suitable for this example).其中 SNP = SNP ID,BF = 贝叶斯因子,BP = 基因组上的 position(我已经在其中捏造了几个 > 50 个值,以使数据适合本示例)。
The issue is that I don't have a SNP for each genome position, otherwise I could simply split the windows of interest based on line count and then count however many lines in the BF column are over 50. Is there any way I can I count the number of SNPs of interest within different windows of the genome positions?问题是我没有每个基因组 position 的 SNP,否则我可以简单地根据行数拆分感兴趣的 windows,然后计算 BF 列中的许多行超过 50。计算基因组位置的不同 windows 内感兴趣的 SNP 数量? Preferably in R, but no issues with using other languages like Python or Bash if it gets the job done.
最好在 R 中,但如果完成工作,使用 Python 或 Bash 等其他语言没有问题。
Thanks!谢谢!
library(slider); library(dplyr)
my_data %>%
mutate(count = slide_index(BF, BP, ~sum(.x > 50), .before = 999999))
This counts how many BF > 50 in the window of the last 1M in BP.这计算了BP中最后1M的window中有多少BF> 50。
SNP BF BP count
1 0001_107388 11.6281471 107388 0
2 0001_193069 2.3334724 193069 0
3 0001_278038 51.3445233 278038 1
4 0001_328786 5.3219689 328786 1
5 0001_523879 50.0324543 523879 2
6 0001_804477 -0.5177719 804477 2
7 0001_990357 6.2354528 990357 2
8 0001_1033297 3.0820671 1033297 2
9 0001_1167609 -2.4278356 1167609 2
10 0001_1222410 52.9644799 1222410 3
11 0001_1490205 10.9809957 1490205 2
12 0001_1689133 3.7536395 1689133 1
13 0001_1746080 3.5199872 1746080 1
14 0001_1746450 -2.8666602 1746450 1
15 0001_1777011 0.1669994 1777011 1
16 0001_2114817 3.2669421 2114817 1
17 0001_2232084 50.4356112 2232084 1
18 0001_2332903 -0.1502232 2332903 1
19 0001_2347062 -1.2090000 2347062 1
20 0001_2426273 1.2309157 2426273 1
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.