简体   繁体   English

我可以使用多个线程更快地分配内存吗?

[英]Can i allocate memory faster by using multiple threads?

If i make a loop that reserves 1kb integer arrays, int[1024], and i want it to allocate 10000 arrays, can i make it faster by running the memory allocations from multiple threads? 如果我创建一个保留1kb整数数组的循环,int [1024],并且我希望它分配10000个数组,我可以通过从多个线程运行内存分配来加快它吗?

I want them to be in the heap. 我希望他们在堆里。

Let's assume that i have a multi-core processor for the job. 让我们假设我有一个多核处理器。

I already did try this, but it decreased the performance. 我已经尝试过这个,但它降低了性能。 I'm just wondering, did I just make bad code or is there something that i didn't know about memory allocation? 我只是想知道,我只是制作了错误的代码,还是有一些我不了解内存分配的东西?

Does the answer depend on the OS? 答案取决于操作系统吗? please tell me how it works on different platforms if so. 如果是这样,请告诉我它在不同平台上的工作原理。

Edit: 编辑:

The integer array allocation loop was just a simplified example. 整数数组分配循环只是一个简化的例子。 Don't bother telling me how I can improve that. 不要打扰告诉我如何改进它。

It depends on many things, but primarily: 这取决于很多事情,但主要是:

  • the OS 操作系统
  • the implementation of malloc you are using 你正在使用的malloc的实现

The OS is responsible for allocating the "virtual memory" that your process has access to and builds a translation table that maps the virtual memory back to actual memory addresses. 操作系统负责分配您的进程可以访问的“虚拟内存”,并构建一个将虚拟内存映射回实际内存地址的转换表。

Now, the default implementation of malloc is generally conservative, and will simply have a giant lock around all this. 现在, malloc的默认实现通常是保守的,并且只是围绕这一切进行了巨大的锁定。 This means that requests are processed serially, and the only thing that allocating from multiple threads instead of one does is slowing down the whole thing. 这意味着请求是按顺序处理的,并且从多个线程而不是一个线程分配的唯一事情就是减慢整个事情的速度。

There are more clever allocation schemes, generally based upon pools, and they can be found in other malloc implementations: tcmalloc (from Google) and jemalloc (used by Facebook) are two such implementations designed for high-performance in multi-threaded applications. 有更聪明的分配方案,通常基于池,它们可以在其他malloc实现中找到: tcmalloc (来自Google)和jemalloc (由Facebook使用)是两种这样的实现,专为多线程应用程序中的高性能而设计。

There is no silver bullet though, and at one point the OS must perform the virtual <=> real translation which requires some form of locking. 虽然没有银弹,但OS必须执行虚拟的<=>真实翻译,这需要某种形式的锁定。

Your best bet is to allocate by arenas: 你最好的选择是通过竞技场分配:

  • Allocate big chunks (arenas) at once 立刻分配大块(竞技场)
  • Split them up in arrays of the appropriate size 将它们分成适当大小的数组

There is no need to parallelize the arena allocation, and you'll be better off asking for the biggest arenas you can (do bear in mind that allocation requests for a too large amount may fail), then you can parallelize the split. 没有必要并行化竞技场分配,你最好还是要求最大的竞技场(请记住,太大的分配请求可能会失败),然后你可以并行分割。

tcmalloc and jemalloc may help a bit, however they are not designed for big allocations (which is unusual) and I do not know if it is possible to configure the size of the arenas they request. tcmallocjemalloc可能会有所帮助,但它们不是为分配而设计的(这是不常见的),我不知道是否可以配置它们请求的竞技场的大小。

The answer depends on the memory allocations routine, which are a combination of a C++ library layer operator new , probably wrapped around libC malloc() , which in turn occasionally calls an OS function such as sbreak() . 答案取决于内存分配例程,它是C ++库层operator new ,可能包含在libC malloc() ,而后者偶尔会调用OS函数,例如sbreak() The implementation and performance characteristics of all of these is unspecified, and may vary from compiler version to version, with compiler flags, different OS versions, different OSes etc.. If profiling shows it's slower, then that's the bottom line. 所有这些的实现和性能特征是未指定的,并且可能因编译器版本而异,具有编译器标志,不同OS版本,不同操作系统等。如果分析显示它较慢,那么这就是底线。 You can try varying the number of threads, but what's probably happening is that the threads are all trying to obtain the same lock in order to modify the heap... the overheads involved with saying "ok, thread X gets the go ahead next" and "thread X here, I'm done" are simply wasting time. 你可以尝试改变线程的数量,但可能发生的是线程都试图获得相同的锁以修改堆...所谓的“好吧,线程X接下来继续”的开销并且“在这里线程X,我已经完成”只是在浪费时间。 Another C++ environment might end up using atomic operations to avoid locking, which might or might not prove faster... no general rule. 另一个C ++环境最终可能会使用原子操作来避免锁定,这可能会或可能不会更快......没有一般规则。

If you want to complete faster, consider allocating one array of 10000*1024 ints, then using different parts of it (eg [0]..[1023] , [1024]..[2047] ...). 如果你想更快地完成,可以考虑分配一个10000 * 1024整数的数组,然后使用它的不同部分(例如[0]..[1023][1024]..[2047] ...)。

I think that perhaps you need to adjust your expectation from multi-threading. 我想也许你需要调整你对多线程的期望。

The main advantage of multi-threading is that you can do tasks asynchronously, ie in parallel . 多线程的主要优点是您可以异步执行任务,即parallel执行任务。 In your case, when your main thread needs more memory it does not matter whether it is allocated by another thread - you still need to stop and wait for allocation to be accomplished, so there is no parallelism here. 在您的情况下,当您的主线程需要更多内存时,它是否由另一个线程分配无关紧要 - 您仍然需要停止并等待分配完成,因此这里no parallelism In addition, there is an overhead of a thread signaling when it is done and the other waiting for completion, which just can degrade the performance. 此外,线程信号在完成时会产生开销,而另一个则在等待完成,这会降低性能。 Also, if you start a thread each time you need allocation this is a huge overhead. 此外,如果每次需要分配时启动一个线程,这将是一个huge开销。 If not, you need some mechanism to pass the allocation request and response between threads, a kind of task queue which again is an overhead without gain. 如果没有,你需要一些机制来传递线程之间的分配请求和响应,这是一种任务队列,它又是一个没有增益的开销。

Another approach could be that the allocating thread runs ahead and pre-allocates the memory that you will need. 另一种方法可能是在分配线程运行,并提前pre-allocates ,你的记忆will需要。 This can give you a real gain, but if you are doing pre-allocation, you might as well do it in the main thread which will be simpler. 这可以给你一个真正的收获,但如果你正在进行预分配,你也可以在主线程中做更简单的事情。 Eg allocate 10M in one shot (or 10 times 1M, or as much contiguous memory as you can have) and have an array of 10,000 pointers pointing to it at 1024 offsets, representing your arrays. 例如,一次性分配10M(或10倍1M,或者你可以拥有的连续内存)并且有一个10,000个指针的数组,指向1024个偏移,表示你的数组。 If you don't need to deallocate them independently of one another this seems to be much simpler and could be even more efficient than using multi-threading. 如果您不需要彼此独立地释放它们,这似乎更简单,并且可能比使用多线程更有效。

As for glibc it has arena 's (see here ), which has lock per arena. 至于glibc,它有竞技场 (见这里 ),每个竞技场都有锁定。

You may also consider tcmalloc by google (stands for Thread-Caching malloc), which shows 30% boost performance for threaded application. 您也可以考虑使用谷歌的tcmalloc (代表Thread-Caching malloc),它显示了线程应用程序30%的提升性能​​。 We use it in our project. 我们在项目中使用它。 In debug mode it even can discover some incorrect usage of memory (eg new/free mismatch) 在调试模式下,它甚至可以发现一些不正确的内存使用情况(例如新的/免费的不匹配)

As far as I know all os have implicit mutex lock inside the dynamic allocating system call (malloc...). 据我所知,所有操作系统都在动态分配系统调用(malloc ...)中隐含了互斥锁。 If you think a moment about that, if you do not lock this action you could run into terrible problems. 如果你想到这一点,如果你不锁定这个动作,你可能会遇到可怕的问题。

You could use the multithreading api threading building blocks http://threadingbuildingblocks.org/ which has a multithreading friendly scalable allocator. 您可以使用多线程api线程构建块http://threadingbuildingblocks.org/ ,它具有多线程友好的可伸缩分配器。

But I think a better idea should be to allocate the whole memory once(should work quite fast) and split it up on your own. 但我认为更好的想法应该是分配整个内存一次(应该工作得非常快)并将其自行拆分。 I think the tbb allocator does something similar. 我认为tbb分配器做了类似的事情。

Do something like 做点什么

new int[1024*10000] and than assign the parts of 1024ints to your pointer array or what ever you use. new int [1024 * 10000]并且将1024个部分分配给指针数组或者你使用的是什么。

Do you understand? 你理解吗?

Because the heap is shared per-process the heap will be locked for each allocation, so it can only be accessed serially by each thread. 由于堆是按进程共享的,因此每个分配都会锁定堆,因此只能由每个线程串行访问。 This could explain the decrease of performance when you do alloc from multiple threads like you are doing. 这可以解释当您从多个线程执行alloc时的性能下降。

The answer depends on the operating system and runtime used, but in most cases, you cannot. 答案取决于所使用的操作系统和运行时,但在大多数情况下,您不能。

Generally, you will have two versions of the runtime: a multi-threaded version and a single-threaded version. 通常,您将拥有两个版本的运行时:多线程版本和单线程版本。

The single-threaded version is not thread-safe. 单线程版本不是线程安全的。 Allocations made by two threads at the same time can blow your application up. 两个线程同时进行的分配可能会使您的应用程序崩溃。

The multi-threaded version is thread-safe. 多线程版本是线程安全的。 However, as far as allocations go on most common implementations, this just means that calls to malloc are wrapped in a mutex. 但是,就分配最常见的实现而言,这只意味着对malloc调用包含在互斥锁中。 Only one thread can ever be in the malloc function at any given time, so attempting to speed up allocations with multiple threads will just result in a lock convoy. 在任何给定时间,只有一个线程可以在malloc函数中,因此尝试使用多个线程加速分配只会导致锁定队列。

It may be possible that there are operating systems that can safely handle parallel allocations within the same process, using minimal locking, which would allow you to decrease time spent allocating. 有可能操作系统可以使用最小锁定在同一进程内安全地处理并行分配,这样可以减少分配时间。 Unfortunately, I don't know of any. 不幸的是,我不知道。

If the arrays belong together and will only be freed as a whole, you can just allocate an array of 10000*1024 ints, and then make your individual arrays point into it. 如果数组属于一起并且只作为一个整体释放,那么你可以只分配一个10000 * 1024整数的数组,然后让你的各个数组指向它。 Just remember that you cannot delete the small arrays, only the whole. 只记得你不能delete小数组,只能delete整个数组。

int *all_arrays = new int[1024 * 10000];
int *small_array123 = all_arrays + 1024 * 123;

Like this, you have small arrays when you replace the 123 with a number between 0 and 9999. 像这样,当你用0到9999之间的数字替换123时,你有小数组。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM