有效/同時插入unordered_map <>

Question

我需要使用以下算法（在Python中）為我的項目收集一些統計信息：

stats = defaultdict(list)
for s in L:
     current = []
     for q in L:
         stats[s, q] = copy(current)
         current = f(current, s, q)

因為列表L很大而f()和復制current需要一些時間，項目主要語言是C ++我決定選擇C ++並使用其多線程功能來實現我的算法。

我移動了那部分：

         stats[s, q] = copy(current)
         current = f(current, s, q)

一個單獨的std::async ，並在插入stats時鎖定了std::mutex ，但這使事情變得更慢。 我試圖使用tbb::concurrent_ordered_map但這讓事情變得更糟。

我寫了基准，再現了它的行為： https ： //gist.github.com/myaut/94ee59d9752f524d3da8

結果為2 x Xeon E5-2420與Debian 7共有800個條目：

single-threaded                       8157ms
async                mutex            10145ms
async with chunks    mutex            9618ms
threads              mutex            9378ms
async                tbb              10173ms
async with chunks    tbb              9560ms
threads              tbb              9418ms

我不明白為什么TBB最慢（似乎tbb::concurrent_ordered_map分配了更多的內存，但是為了什么）。 有沒有其他選擇可以幫助我？

編輯：我已經用建議的方法更新了我的基准（並將N減少到800）。 似乎問題出在其他地方......

塊 - 感謝@Dave - 現在每個async處理20個連續元素列表的捆綁
線程 - @Cameron建議的一種線程池 - 我創建了20個線程，每個線程都獲取初始列表的每個第20個元素。

編輯2 ：我發現其中一個問題 - 應用消耗大量內存，因此Xen Hypervisor成為瓶頸 - 在本機模式下重啟，現在多線程模式它比單線程慢一點

EDIT3 ：多線程的問題似乎是復制list時的大量分配：

mprotect()
_int_malloc+0xcba/0x13f0
__libc_malloc+0x70/0x260
operator new(unsigned long)+0x1d/0x90
__gnu_cxx::new_allocator<int>::allocate(unsigned long, void const*)+0x40/0x42
std::_Vector_base<int, std::allocator<int> >::_M_allocate(unsigned long)+0x2f/0x38
std::_Vector_base<int, std::allocator<int> >::_M_create_storage(unsigned long)+0x23/0x58
std::_Vector_base<int, std::allocator<int> >::_Vector_base(unsigned long, std::allocator<int> const&)+0x3b/0x5e
std::vector<int, std::allocator<int> >::vector(std::vector<int, std::allocator<int> > const&)+0x55/0xf0
void threaded_process<concurrent_map_of_list_of_lists>(concurrent_map_of_list_of_lists&, std::vector<int, std::allocator<int> > const&)::{lambda(__gnu_cxx::__normal_iterator<int const*, std::vector<int, std::allocator<int> > >, __gnu_cxx::__normal_iterator<int const*, std::vector<int, std::allocator<int> > >, int)#1}::operator()(__gnu_cxx::__normal_iterator<int const*, std::vector<int, std::allocator<int> > >, __gnu_cxx::__normal_iterator<int const*, std::vector<int, std::allocator<int> > >, int) const+0x5f/0x1dc
_ZNSt12_Bind_simpleIFZ16threaded_processI31concurrent_map_of_list_of_listsEvRT_RKSt6vectorIiSaIiEEEUlN9__gnu_cxx17__normal_iteratorIPKiS6_EESD_iE_SD_SD_iEE9_M_invokeIJLm0ELm1ELm2EEEEvSt12_Index_tupleIJXspT_EEE+0x7c/0x87
std::_Bind_simple<void threaded_process<concurrent_map_of_list_of_lists>(concurrent_map_of_list_of_lists&, std::vector<int, std::allocator<int> > const&)::{lambda(__gnu_cxx::__normal_iterator<int const*, std::vector<int, std::allocator<int> > >, __gnu_cxx::__normal_iterator<int const*, std::vector<int, std::allocator<int> > >, int)#1} (__gnu_cxx::__normal_iterator<int const*, std::vector<int, std::allocator<int> > >, __gnu_cxx::__normal_iterator<int const*, std::vector<int, std::allocator<int> > >, int)>::operator()()+0x1b/0x28
std::thread::_Impl<std::_Bind_simple<void threaded_process<concurrent_map_of_list_of_lists>(concurrent_map_of_list_of_lists&, std::vector<int, std::allocator<int> > const&)::{lambda(__gnu_cxx::__normal_iterator<int const*, std::vector<int, std::allocator<int> > >, __gnu_cxx::__normal_iterator<int const*, std::vector<int, std::allocator<int> > >, int)#1} (__gnu_cxx::__normal_iterator<int const*, std::vector<int, std::allocator<int> > >, __gnu_cxx::__normal_iterator<int const*, std::vector<int, std::allocator<int> > >, int)> >::_M_run()+0x1c/0x1e
std::error_code::default_error_condition() const+0x40/0xc0
start_thread+0xd0/0x300
clone+0x6d/0x90

問題是當堆空間耗盡時，libc調用grow_heap() ，它通常只添加一個頁面，但隨后調用mprotect()調用內核中的validate_mm() 。 validate_mm()似乎使用信號量鎖定整個VMA。 我用tbb::scalable_allocator替換了默認list分配器，它搖滾！ 現在tbb比單處理器方法快2倍。

感謝您的幫助，我將使用@Dave方法與std::async大塊工作。

Answer 1

如果f(current, s, q)和復制current非常便宜，那么多線程的開銷就很難實現。 但是，我想我會使用無鎖散列/無序映射（ tbb::concurrent_hash_map ？我不知道tbb）並使用std::async啟動整個內部for循環。 我的想法是使用std::async啟動一個相當大的工作塊，如果它太小而且你啟動了一百萬個瑣碎的任務，那么使用std::async的開銷將會超出任務必須完成的工作！

另外需要注意的是，當你使用std::async你需要在某個地方保存返回的future ，或者它最終會阻塞，直到任務在future的析構函數中完成，購買多線程開銷並且根本沒有並行處理。 你現在可能正在遇到這種情況。 這是非常令人討厭的，我希望它不會這樣。

有效/同時插入unordered_map <>

問題描述

1 個解決方案

解決方案1
3 已采納 2015-03-12 17:35:18

有效/同時插入unordered_map &lt;&gt;

問題描述

1 個解決方案

解決方案1 3 已采納 2015-03-12 17:35:18

有效/同時插入unordered_map <>

解決方案1
3 已采納 2015-03-12 17:35:18