有效/同时插入unordered_map <>

Question

我需要使用以下算法（在Python中）为我的项目收集一些统计信息：

stats = defaultdict(list)
for s in L:
     current = []
     for q in L:
         stats[s, q] = copy(current)
         current = f(current, s, q)

因为列表L很大而f()和复制current需要一些时间，项目主要语言是C ++我决定选择C ++并使用其多线程功能来实现我的算法。

我移动了那部分：

         stats[s, q] = copy(current)
         current = f(current, s, q)

一个单独的std::async ，并在插入stats时锁定了std::mutex ，但这使事情变得更慢。 我试图使用tbb::concurrent_ordered_map但这让事情变得更糟。

我写了基准，再现了它的行为： https ： //gist.github.com/myaut/94ee59d9752f524d3da8

结果为2 x Xeon E5-2420与Debian 7共有800个条目：

single-threaded                       8157ms
async                mutex            10145ms
async with chunks    mutex            9618ms
threads              mutex            9378ms
async                tbb              10173ms
async with chunks    tbb              9560ms
threads              tbb              9418ms

我不明白为什么TBB最慢（似乎tbb::concurrent_ordered_map分配了更多的内存，但是为了什么）。 有没有其他选择可以帮助我？

编辑：我已经用建议的方法更新了我的基准（并将N减少到800）。 似乎问题出在其他地方......

块 - 感谢@Dave - 现在每个async处理20个连续元素列表的捆绑
线程 - @Cameron建议的一种线程池 - 我创建了20个线程，每个线程都获取初始列表的每个第20个元素。

编辑2 ：我发现其中一个问题 - 应用消耗大量内存，因此Xen Hypervisor成为瓶颈 - 在本机模式下重启，现在多线程模式它比单线程慢一点

EDIT3 ：多线程的问题似乎是复制list时的大量分配：

mprotect()
_int_malloc+0xcba/0x13f0
__libc_malloc+0x70/0x260
operator new(unsigned long)+0x1d/0x90
__gnu_cxx::new_allocator<int>::allocate(unsigned long, void const*)+0x40/0x42
std::_Vector_base<int, std::allocator<int> >::_M_allocate(unsigned long)+0x2f/0x38
std::_Vector_base<int, std::allocator<int> >::_M_create_storage(unsigned long)+0x23/0x58
std::_Vector_base<int, std::allocator<int> >::_Vector_base(unsigned long, std::allocator<int> const&)+0x3b/0x5e
std::vector<int, std::allocator<int> >::vector(std::vector<int, std::allocator<int> > const&)+0x55/0xf0
void threaded_process<concurrent_map_of_list_of_lists>(concurrent_map_of_list_of_lists&, std::vector<int, std::allocator<int> > const&)::{lambda(__gnu_cxx::__normal_iterator<int const*, std::vector<int, std::allocator<int> > >, __gnu_cxx::__normal_iterator<int const*, std::vector<int, std::allocator<int> > >, int)#1}::operator()(__gnu_cxx::__normal_iterator<int const*, std::vector<int, std::allocator<int> > >, __gnu_cxx::__normal_iterator<int const*, std::vector<int, std::allocator<int> > >, int) const+0x5f/0x1dc
_ZNSt12_Bind_simpleIFZ16threaded_processI31concurrent_map_of_list_of_listsEvRT_RKSt6vectorIiSaIiEEEUlN9__gnu_cxx17__normal_iteratorIPKiS6_EESD_iE_SD_SD_iEE9_M_invokeIJLm0ELm1ELm2EEEEvSt12_Index_tupleIJXspT_EEE+0x7c/0x87
std::_Bind_simple<void threaded_process<concurrent_map_of_list_of_lists>(concurrent_map_of_list_of_lists&, std::vector<int, std::allocator<int> > const&)::{lambda(__gnu_cxx::__normal_iterator<int const*, std::vector<int, std::allocator<int> > >, __gnu_cxx::__normal_iterator<int const*, std::vector<int, std::allocator<int> > >, int)#1} (__gnu_cxx::__normal_iterator<int const*, std::vector<int, std::allocator<int> > >, __gnu_cxx::__normal_iterator<int const*, std::vector<int, std::allocator<int> > >, int)>::operator()()+0x1b/0x28
std::thread::_Impl<std::_Bind_simple<void threaded_process<concurrent_map_of_list_of_lists>(concurrent_map_of_list_of_lists&, std::vector<int, std::allocator<int> > const&)::{lambda(__gnu_cxx::__normal_iterator<int const*, std::vector<int, std::allocator<int> > >, __gnu_cxx::__normal_iterator<int const*, std::vector<int, std::allocator<int> > >, int)#1} (__gnu_cxx::__normal_iterator<int const*, std::vector<int, std::allocator<int> > >, __gnu_cxx::__normal_iterator<int const*, std::vector<int, std::allocator<int> > >, int)> >::_M_run()+0x1c/0x1e
std::error_code::default_error_condition() const+0x40/0xc0
start_thread+0xd0/0x300
clone+0x6d/0x90

问题是当堆空间耗尽时，libc调用grow_heap() ，它通常只添加一个页面，但随后调用mprotect()调用内核中的validate_mm() 。 validate_mm()似乎使用信号量锁定整个VMA。 我用tbb::scalable_allocator替换了默认list分配器，它摇滚！ 现在tbb比单处理器方法快2倍。

感谢您的帮助，我将使用@Dave方法与std::async大块工作。

Answer 1

如果f(current, s, q)和复制current非常便宜，那么多线程的开销就很难实现。 但是，我想我会使用无锁散列/无序映射（ tbb::concurrent_hash_map ？我不知道tbb）并使用std::async启动整个内部for循环。 我的想法是使用std::async启动一个相当大的工作块，如果它太小而且你启动了一百万个琐碎的任务，那么使用std::async的开销将会超出任务必须完成的工作！

另外需要注意的是，当你使用std::async你需要在某个地方保存返回的future ，或者它最终会阻塞，直到任务在future的析构函数中完成，购买多线程开销并且根本没有并行处理。 你现在可能正在遇到这种情况。 这是非常令人讨厌的，我希望它不会这样。

有效/同时插入unordered_map <>

问题描述

1 个解决方案

解决方案1
3 已采纳 2015-03-12 17:35:18

有效/同时插入unordered_map &lt;&gt;

问题描述

1 个解决方案

解决方案1 3 已采纳 2015-03-12 17:35:18

有效/同时插入unordered_map <>

解决方案1
3 已采纳 2015-03-12 17:35:18