简体繁体 English

空间数据和并行计算

[英]spatial data and parallel computing

原文 2020-12-20 06:01:05 8 1 r/ performance/ parallel-processing/ hardware/ spatial

I'll be starting my dissertation work in the new year.我将在新的一年开始我的论文工作。 I will be doing computationally intense analyses with large spatial data (running spatial regression and geo-weighted regression models with US census tract shapefile data).我将使用大型空间数据进行计算密集型分析（使用美国人口普查区 shapefile 数据运行空间回归和地理加权回归模型）。 My current computer freezes up when I open up the shapefile.当我打开 shapefile 时，我当前的计算机死机了。 It's a MacBook Pro with 4 cores, 16GB RAM, and 3.4 Ghz processors.这是一款 4 核、16GB RAM 和 3.4 Ghz 处理器的 MacBook Pro。 But I'm upgrading to a iMac with 128GB of RAM, 3.6Ghz processor with 8 cores.但我正在升级到具有 128GB RAM、3.6Ghz 处理器和 8 核的 iMac。

However, I've been reading about parallel processing and realizing that R only uses one core.但是，我一直在阅读有关并行处理的内容，并意识到 R 仅使用一个内核。 So, does that mean that the additional new cores will be useless?那么，这是否意味着额外的新内核将毫无用处？ If so, then maybe I save some money and don't go for the extra cores?如果是这样，那么也许我可以节省一些钱并且不要为额外的内核 go ？ I understand I can use the parallel package (and some others), but I'm not sure that works with the spatial regressions packages.我知道我可以使用并行 package （和其他一些），但我不确定它是否适用于空间回归包。

Any suggestions here would be very much appreciated.这里的任何建议将不胜感激。

Best, Kasey最好的，凯西

1 个解决方案

R is capable of using multiple cores but not in the same way as other languages like python. R 能够使用多个内核，但与 python 等其他语言的方式不同。 When you use the parallel package it pretty much starts an R session per core assigned.当您使用parallel package 时，它几乎会启动 R session 每个分配的核心。 Each core loads a copy of the data and does not use shared memory:'(. So you are making use of the multiple cores, your 8 physical cores should be 16 virtual cores with hyperthreading . For example, if you load a list of four dataframes you can analyse them in parallel using the parallel package over four cores, each core starts an R session, loads in the data, and analyses part of the data.每个核心加载一个数据副本并且不使用共享 memory:'(。所以您正在使用多个核心，您的 8 个物理核心应该是 16 个具有超线程的虚拟核心。例如，如果您加载一个包含四个您可以使用parallel package 在四个内核上并行分析它们，每个内核启动一个 R session，加载数据，并分析部分数据。

The process of assigning data to separate cores has some overhead so doing a job in serial is most resource efficient.将数据分配给单独的内核的过程有一些开销，因此串行执行工作是最节省资源的。 In serial means that each of the four dataframes are analysed in sequence (one after the other) not in parallel.串行意味着四个数据帧中的每一个都按顺序（一个接一个）而不是并行进行分析。

Since it may take a long time to do what you want in serial (say to looping over thousands of independent dataframes), going parallel can save you time and you can do some scaling tests to determine the number of cores that will be most efficient (eg, using 20 cores may save little more time than using 16 because scaling is not linear in time gain with number of cores, see link 2 below).由于串行执行您想要的操作可能需要很长时间（例如循环数千个独立数据帧），因此并行可以节省您的时间，您可以进行一些扩展测试以确定最有效的核心数量（例如，使用 20 个内核可能比使用 16 个内核节省更多时间，因为时间增益与内核数量的比例不是线性的，请参见下面的链接 2）。 If your data are huge you may run into ram limitations because each core will require a chunk of ram to load and process the data (eg, maybe you can only use 4 cores because each one needs to load 30 gb of data and store it in RAM, put very very roughly).如果您的数据很大，您可能会遇到内存限制，因为每个核心都需要一大块内存来加载和处理数据（例如，您可能只能使用 4 个核心，因为每个核心需要加载 30 GB 的数据并将其存储在RAM，非常粗略地说）。

I can't speak as to which spatial packages will work in parallel but if the analyses on each core are independent of each other then it shouldn't be an issue (never had problems myself that is.).我不能说哪些空间包将并行工作，但如果每个核心上的分析彼此独立，那么它应该不是问题（我自己从来没有遇到过问题。）。 If you are doing something complicated that requires data to be stitched together among cores then maybe some packages can't handle that.如果您正在做一些复杂的事情，需要在核心之间将数据缝合在一起，那么可能有些软件包无法处理。

Additional cores won't be useless but ultimately the best allocation of computer resources is dependent on the data and analyses.额外的内核不会无用，但最终计算机资源的最佳分配取决于数据和分析。 I guess I wouldn't base the decision of what computer to purchase on a single project, you may do more scientific computing in the future and often don't necessarily have an accurate idea of the required resources in advance.我想我不会基于单个项目来决定购买什么计算机，您将来可能会进行更多的科学计算，并且通常不一定事先对所需的资源有准确的了解。 Also, your university may also have some high performance computing infrastructure for heavy tasks.此外，您的大学可能还有一些用于繁重任务的高性能计算基础设施。

This isn't a definitive answer but was too long for a comment (I can remove if inappropriate): Hope it can help :)这不是一个明确的答案，但评论太长了（如果不合适，我可以删除）：希望它可以帮助:)

See these links for more detail:有关更多详细信息，请参阅这些链接：

PS In the parallel package, make sure you are using the correct function for your OS otherwise it may just run single threaded without you knowing. PS 在parallel package 中，确保您为您的操作系统使用正确的 function，否则它可能只是在您不知情的情况下运行单线程。

PPS do everything you can to increase effeciency in serial by efficient programming (eg, using numeric matrix rather than data frames or being careful subsetting large data as you will be creating copies in ram). PPS 尽你所能通过高效编程来提高串行效率（例如，使用数字矩阵而不是数据帧，或者小心地对大数据进行子集化，因为你将在 ram 中创建副本）。 Do some profiling to figure out where your bottle necks are and focus on those first.做一些分析来找出你的瓶颈在哪里，然后首先关注那些。 Then worry about going parallel:)然后担心并行:)