有效地获取稀疏数据的移动平均值并在python中过滤阈值以上

Question

我正在接受一些基因组分析，我有点陷入困境。 我有一些非常稀疏的数据，需要找到移动平均值超过某个阈值的地方，将每个点标记为1或0.数据是唯一类型，因此我无法使用可用的程序进行分析。

每个点代表人类基因组上的一个点（碱基对）。 对于每个数据集，有200,000,000个潜在点。 该数据基本上是~12000个索引/值对的列表，其中假设所有其他点为零。 我需要做的是在整个数据集中获取移动平均值，并返回平均值高于阈值的区域。

我目前正在从数据集中顺序读取每个点，并在我找到的每个点周围构建一个数组，但对于大窗口大小来说这是非常慢的。 有没有更有效的方法来做到这一点，也许是scipy或熊猫？

编辑：下面杰米的魔法代码很棒（但我还不能投票）！ 我非常感激。

Answer 1

你可以用numpy对整个事物进行矢量化。 我已经构建了这个（aprox。）12,000个索引的随机数据集，介于0和199,999,999之间，以及一个同样长的0到1之间的随机浮点数列表：

indices = np.unique(np.random.randint(2e8,size=(12000,)))
values = np.random.rand(len(indices))

然后我围绕每个索引构建一个总窗口大小为2*win+1的indices数组，以及相应的数组，表示该点对移动平均值的贡献：

win = 10

avg_idx = np.arange(-win, win+1) + indices[:, None]
avg_val = np.tile(values[:, None]/(2*win+1), (1, 2*win+1))

剩下的就是找出重复的指数并一起增加对移动平均线的贡献：

unique_idx, _ = np.unique(avg_idx, return_inverse=True)
mov_avg = np.bincount(_, weights=avg_val.ravel())

您现在可以获得指数列表，例如移动平均线超过0.5，如下：

unique_idx[mov_avg > 0.5]

至于性能，首先将上面的代码转换为函数：

def sparse_mov_avg(idx, val, win):
    avg_idx = np.arange(-win, win+1) + idx[:, None]
    avg_val = np.tile(val[:, None]/(2*win+1), (1, 2*win+1))
    unique_idx, _ = np.unique(avg_idx, return_inverse=True)
    mov_avg = np.bincount(_, weights=avg_val.ravel())
    return unique_idx, mov_avg

以下是几种窗口大小的一些时序，对于开头描述的测试数据：

In [2]: %timeit sparse_mov_avg(indices, values, 10)
10 loops, best of 3: 33.7 ms per loop

In [3]: %timeit sparse_mov_avg(indices, values, 100)
1 loops, best of 3: 378 ms per loop

In [4]: %timeit sparse_mov_avg(indices, values, 1000)
1 loops, best of 3: 4.33 s per loop

有效地获取稀疏数据的移动平均值并在python中过滤阈值以上

问题描述

1 个解决方案

解决方案1
3 已采纳 2013-05-02 06:35:36

有效地获取稀疏数据的移动平均值并在python中过滤阈值以上

问题描述

1 个解决方案

解决方案1 3 已采纳 2013-05-02 06:35:36

解决方案1
3 已采纳 2013-05-02 06:35:36