简体   繁体   English

有效/同时插入unordered_map <>

[英]Efficiently/concurrently insert into unordered_map<>

I need to gather some statistics for my project with the following algorithm (in Python): 我需要使用以下算法(在Python中)为我的项目收集一些统计信息:

stats = defaultdict(list)
for s in L:
     current = []
     for q in L:
         stats[s, q] = copy(current)
         current = f(current, s, q)

Because list L is large and f() and copying current take some time and project main language is C++ I decided to choose C++ and use its multithreading capabilities to implement my algorithm. 因为列表L很大而f()和复制current需要一些时间,项目主要语言是C ++我决定选择C ++并使用其多线程功能来实现我的算法。

I moved that part: 我移动了那部分:

         stats[s, q] = copy(current)
         current = f(current, s, q)

to a separate std::async , and locked std::mutex while inserting to stats but that made things slower. 一个单独的std::async ,并在插入stats时锁定了std::mutex ,但这使事情变得更慢。 I tried to use tbb::concurrent_ordered_map but that made things worse. 我试图使用tbb::concurrent_ordered_map但这让事情变得更糟。

I wrote benchmark that reproduces its behavior: https://gist.github.com/myaut/94ee59d9752f524d3da8 我写了基准,再现了它的行为: https//gist.github.com/myaut/94ee59d9752f524d3da8

Results for 2 x Xeon E5-2420 with Debian 7 for 800 entries in L: 结果为2 x Xeon E5-2420与Debian 7共有800个条目:

single-threaded                       8157ms
async                mutex            10145ms
async with chunks    mutex            9618ms
threads              mutex            9378ms
async                tbb              10173ms
async with chunks    tbb              9560ms
threads              tbb              9418ms

I do not understand why TBB is slowest (it seem tbb::concurrent_ordered_map allocates larger amounts of memory, but for what). 我不明白为什么TBB最慢(似乎tbb::concurrent_ordered_map分配了更多的内存,但是为了什么)。 Are there any other options that can help me? 有没有其他选择可以帮助我?

EDIT : I've updated my benchmark with suggested approaches (and reduced N to 800). 编辑 :我已经用建议的方法更新了我的基准(并将N减少到800)。 It seem that problem is somewhere else... 似乎问题出在其他地方......

  • chunks - thanks to @Dave - now each async handles bundle of 20 sequential element of list - 感谢@Dave - 现在每个async处理20个连续元素列表的捆绑
  • threads - kind of threadpool as @Cameron suggests - I create 20 threads and each of them takes each 20th element of initial list. 线程 - @Cameron建议的一种线程池 - 我创建了20个线程,每个线程都获取初始列表的每个第20个元素。

EDIT2 : I found one of the issues -- app consumes large amount of memory, so Xen Hypervisor became a bottleneck -- rebooted in native mode, now multithreaded modes it is only bit slower than uni-threaded 编辑2 :我发现其中一个问题 - 应用消耗大量内存,因此Xen Hypervisor成为瓶颈 - 在本机模式下重启,现在多线程模式它比单线程慢一点

EDIT3 : It seems the issue with multithreading is huge amount of allocations when copying list : EDIT3 :多线程的问题似乎是复制list时的大量分配:

mprotect()
_int_malloc+0xcba/0x13f0
__libc_malloc+0x70/0x260
operator new(unsigned long)+0x1d/0x90
__gnu_cxx::new_allocator<int>::allocate(unsigned long, void const*)+0x40/0x42
std::_Vector_base<int, std::allocator<int> >::_M_allocate(unsigned long)+0x2f/0x38
std::_Vector_base<int, std::allocator<int> >::_M_create_storage(unsigned long)+0x23/0x58
std::_Vector_base<int, std::allocator<int> >::_Vector_base(unsigned long, std::allocator<int> const&)+0x3b/0x5e
std::vector<int, std::allocator<int> >::vector(std::vector<int, std::allocator<int> > const&)+0x55/0xf0
void threaded_process<concurrent_map_of_list_of_lists>(concurrent_map_of_list_of_lists&, std::vector<int, std::allocator<int> > const&)::{lambda(__gnu_cxx::__normal_iterator<int const*, std::vector<int, std::allocator<int> > >, __gnu_cxx::__normal_iterator<int const*, std::vector<int, std::allocator<int> > >, int)#1}::operator()(__gnu_cxx::__normal_iterator<int const*, std::vector<int, std::allocator<int> > >, __gnu_cxx::__normal_iterator<int const*, std::vector<int, std::allocator<int> > >, int) const+0x5f/0x1dc
_ZNSt12_Bind_simpleIFZ16threaded_processI31concurrent_map_of_list_of_listsEvRT_RKSt6vectorIiSaIiEEEUlN9__gnu_cxx17__normal_iteratorIPKiS6_EESD_iE_SD_SD_iEE9_M_invokeIJLm0ELm1ELm2EEEEvSt12_Index_tupleIJXspT_EEE+0x7c/0x87
std::_Bind_simple<void threaded_process<concurrent_map_of_list_of_lists>(concurrent_map_of_list_of_lists&, std::vector<int, std::allocator<int> > const&)::{lambda(__gnu_cxx::__normal_iterator<int const*, std::vector<int, std::allocator<int> > >, __gnu_cxx::__normal_iterator<int const*, std::vector<int, std::allocator<int> > >, int)#1} (__gnu_cxx::__normal_iterator<int const*, std::vector<int, std::allocator<int> > >, __gnu_cxx::__normal_iterator<int const*, std::vector<int, std::allocator<int> > >, int)>::operator()()+0x1b/0x28
std::thread::_Impl<std::_Bind_simple<void threaded_process<concurrent_map_of_list_of_lists>(concurrent_map_of_list_of_lists&, std::vector<int, std::allocator<int> > const&)::{lambda(__gnu_cxx::__normal_iterator<int const*, std::vector<int, std::allocator<int> > >, __gnu_cxx::__normal_iterator<int const*, std::vector<int, std::allocator<int> > >, int)#1} (__gnu_cxx::__normal_iterator<int const*, std::vector<int, std::allocator<int> > >, __gnu_cxx::__normal_iterator<int const*, std::vector<int, std::allocator<int> > >, int)> >::_M_run()+0x1c/0x1e
std::error_code::default_error_condition() const+0x40/0xc0
start_thread+0xd0/0x300
clone+0x6d/0x90

The thing is when heap space exhausted, libc calls grow_heap() , which usually adds only one page, but then it calls mprotect() which calls validate_mm() in kernel. 问题是当堆空间耗尽时,libc调用grow_heap() ,它通常只添加一个页面,但随后调用mprotect()调用内核中的validate_mm() validate_mm() seem to lock entire VMA using semaphore. validate_mm()似乎使用信号量锁定整个VMA。 I replaced default list allocator with tbb::scalable_allocator , it rocks! 我用tbb::scalable_allocator替换了默认list分配器,它摇滚! Now tbb is 2 times faster than uni-processor approach. 现在tbb比单处理器方法快2倍。

Thanks for your help, I will use @Dave approach with chunks of job in std::async . 感谢您的帮助,我将使用@Dave方法与std::async大块工作。

If f(current, s, q) and copying current are trivially cheap it's going to be hard to make going wide with multithreading worth the overhead. 如果f(current, s, q)和复制current非常便宜,那么多线程的开销就很难实现。 However, I think I would use a lock free hash/unordered map ( tbb::concurrent_hash_map ? I don't know tbb) and launch the entire inner for loop with std::async . 但是,我想我会使用无锁散列/无序映射( tbb::concurrent_hash_map ?我不知道tbb)并使用std::async启动整个内部for循环。 The idea is to launch a decent sized chunk of work with std::async , if it's too tiny and you launch a million trivial tasks the overhead of using std::async will eclipse the work the task has to do! 我的想法是使用std::async启动一个相当大的工作块,如果它太小而且你启动了一百万个琐碎的任务,那么使用std::async的开销将会超出任务必须完成的工作!

Also to note, when you use std::async you need to save off the returned future somewhere or it ends up blocking until the task is done in the future 's destructor, buying you multithreading overhead and no parallel processing at all. 另外需要注意的是,当你使用std::async你需要在某个地方保存返回的future ,或者它最终会阻塞,直到任务在future的析构函数中完成,购买多线程开销并且根本没有并行处理。 You may be running into that now. 你现在可能正在遇到这种情况。 It's very obnoxious and I wish it didn't work that way. 这是非常令人讨厌的,我希望它不会这样。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM