简体   繁体   English

TraMineR的并行计算

[英]Parallel computing for TraMineR

I have a large dataset with more than 250,000 observations, and I would like to use the TraMineR package for my analysis. 我有一个包含25万多个观测值的大型数据集,我想使用TraMineR软件包进行分析。 In particular, I would like to use the commands seqtree and seqdist , which works fine when I for example use a subsample of 10,000 observations. 特别是,我想使用命令seqtreeseqdist ,当我使用10,000个观测值的子样本时,它可以很好地工作。 The limit my computer can manage is around 20,000 observations. 我的计算机可以管理的上限约为20,000个观察值。

I would like to use all the observations and I do have access to a supercomputer who should be able to do just that. 我想利用所有观察结果,并且可以访问一台能够做到这一点的超级计算机。 However, this doesn't help much as the process runs on a single core only. 但是,这并没有太大帮助,因为该过程仅在单个内核上运行。 Therefore my question, is it possible to apply parallel computing technics to the above mentioned commands? 因此,我的问题是,是否可以将并行计算技术应用于上述命令? Or are there other ways to speed up the process? 还是有其他方法可以加快这一过程? Any help would be appreciated! 任何帮助,将不胜感激!

The internal seqdist function is written in C++ and has numerous optimizations. 内部的seqdist函数是用C ++编写的,具有许多优化功能。 For this reason, if you want to parallelize seqdist, you need to do it in C++. 因此,如果要并行化seqdist,则需要在C ++中进行。 The loop is located in the source file "distancefunctions.cpp" and you need to look at the two loops located around line 300 in function "cstringdistance" (Sorry but all comments are in French). 该循环位于源文件“ distancefunctions.cpp”中,您需要查看函数“ cstringdistance”中第300行附近的两个循环(对不起,但所有注释均使用法语)。 Unfortunately, the second important optimization is that the memory is shared between all computations. 不幸的是,第二个重要的优化是所有计算之间共享内存。 For this reason, I think that parallelization would be very complicated. 因此,我认为并行化将非常复杂。

Apart from selecting a sample, you should consider the following optimizations: 除了选择样本之外,您还应该考虑以下优化:

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM