简体繁体 English

TraMineR的并行计算

[英]Parallel computing for TraMineR

原文 2013-07-04 07:35:31 3 1 r/ parallel-processing/ traminer

I have a large dataset with more than 250,000 observations, and I would like to use the TraMineR package for my analysis. 我有一个包含25万多个观测值的大型数据集，我想使用TraMineR软件包进行分析。 In particular, I would like to use the commands seqtree and seqdist , which works fine when I for example use a subsample of 10,000 observations. 特别是，我想使用命令seqtree和seqdist ，当我使用10,000个观测值的子样本时，它可以很好地工作。 The limit my computer can manage is around 20,000 observations. 我的计算机可以管理的上限约为20,000个观察值。

I would like to use all the observations and I do have access to a supercomputer who should be able to do just that. 我想利用所有观察结果，并且可以访问一台能够做到这一点的超级计算机。 However, this doesn't help much as the process runs on a single core only. 但是，这并没有太大帮助，因为该过程仅在单个内核上运行。 Therefore my question, is it possible to apply parallel computing technics to the above mentioned commands? 因此，我的问题是，是否可以将并行计算技术应用于上述命令？ Or are there other ways to speed up the process? 还是有其他方法可以加快这一过程？ Any help would be appreciated! 任何帮助，将不胜感激！

1 个解决方案

The internal seqdist function is written in C++ and has numerous optimizations. 内部的seqdist函数是用C ++编写的，具有许多优化功能。 For this reason, if you want to parallelize seqdist, you need to do it in C++. 因此，如果要并行化seqdist，则需要在C ++中进行。 The loop is located in the source file "distancefunctions.cpp" and you need to look at the two loops located around line 300 in function "cstringdistance" (Sorry but all comments are in French). 该循环位于源文件“ distancefunctions.cpp”中，您需要查看函数“ cstringdistance”中第300行附近的两个循环（对不起，但所有注释均使用法语）。 Unfortunately, the second important optimization is that the memory is shared between all computations. 不幸的是，第二个重要的优化是所有计算之间共享内存。 For this reason, I think that parallelization would be very complicated. 因此，我认为并行化将非常复杂。

Apart from selecting a sample, you should consider the following optimizations: 除了选择样本之外，您还应该考虑以下优化：

aggregation of identical sequences (see here: Problem with big data (?) during computation of sequence distances using TraMineR ) 相同序列的聚合（请参见此处：使用TraMineR计算序列距离时大数据（？）问题）
If relevant, you can try to reduce the time granularity. 如果相关，您可以尝试减少时间间隔。 Distance computation time is highly dependent on sequence length (O^2). 距离计算时间高度依赖于序列长度（O ^ 2）。 See https://stats.stackexchange.com/questions/43601/modifying-the-time-granularity-of-a-state-sequence 参见https://stats.stackexchange.com/questions/43601/modifying-the-time-granularity-of-a-state-sequence
Reducing time granularity may also increase the number of identical sequences, and hence, the impact of optimization one. 减少时间粒度也可能会增加相同序列的数量，因此也会增加优化序列的影响。
There is a hidden option in seqdist to use an optimized version of the optimal matching algorithm. seqdist有一个隐藏选项，可使用最佳匹配算法的优化版本。 It is still in testing phase (that's why it is hidden), but it should replace the actual algorithm in a future version. 它仍处于测试阶段（这就是为什么它被隐藏），但是它应该在将来的版本中替代实际的算法。 To use it, set method="OMopt" , instead of method="OM" . 要使用它，请设置method="OMopt" ，而不是method="OM" 。 Depending on your sequences, it may reduce computation time. 根据您的序列，它可能会减少计算时间。