Efficiently/concurrently insert into unordered_map<>

Question

I need to gather some statistics for my project with the following algorithm (in Python):

stats = defaultdict(list)
for s in L:
     current = []
     for q in L:
         stats[s, q] = copy(current)
         current = f(current, s, q)

Because list L is large and f() and copying current take some time and project main language is C++ I decided to choose C++ and use its multithreading capabilities to implement my algorithm.

I moved that part:

         stats[s, q] = copy(current)
         current = f(current, s, q)

to a separate std::async , and locked std::mutex while inserting to stats but that made things slower. I tried to use tbb::concurrent_ordered_map but that made things worse.

I wrote benchmark that reproduces its behavior: https://gist.github.com/myaut/94ee59d9752f524d3da8

Results for 2 x Xeon E5-2420 with Debian 7 for 800 entries in L:

single-threaded                       8157ms
async                mutex            10145ms
async with chunks    mutex            9618ms
threads              mutex            9378ms
async                tbb              10173ms
async with chunks    tbb              9560ms
threads              tbb              9418ms

I do not understand why TBB is slowest (it seem tbb::concurrent_ordered_map allocates larger amounts of memory, but for what). Are there any other options that can help me?

EDIT : I've updated my benchmark with suggested approaches (and reduced N to 800). It seem that problem is somewhere else...

chunks - thanks to @Dave - now each async handles bundle of 20 sequential element of list
threads - kind of threadpool as @Cameron suggests - I create 20 threads and each of them takes each 20th element of initial list.

EDIT2 : I found one of the issues -- app consumes large amount of memory, so Xen Hypervisor became a bottleneck -- rebooted in native mode, now multithreaded modes it is only bit slower than uni-threaded

EDIT3 : It seems the issue with multithreading is huge amount of allocations when copying list :

mprotect()
_int_malloc+0xcba/0x13f0
__libc_malloc+0x70/0x260
operator new(unsigned long)+0x1d/0x90
__gnu_cxx::new_allocator<int>::allocate(unsigned long, void const*)+0x40/0x42
std::_Vector_base<int, std::allocator<int> >::_M_allocate(unsigned long)+0x2f/0x38
std::_Vector_base<int, std::allocator<int> >::_M_create_storage(unsigned long)+0x23/0x58
std::_Vector_base<int, std::allocator<int> >::_Vector_base(unsigned long, std::allocator<int> const&)+0x3b/0x5e
std::vector<int, std::allocator<int> >::vector(std::vector<int, std::allocator<int> > const&)+0x55/0xf0
void threaded_process<concurrent_map_of_list_of_lists>(concurrent_map_of_list_of_lists&, std::vector<int, std::allocator<int> > const&)::{lambda(__gnu_cxx::__normal_iterator<int const*, std::vector<int, std::allocator<int> > >, __gnu_cxx::__normal_iterator<int const*, std::vector<int, std::allocator<int> > >, int)#1}::operator()(__gnu_cxx::__normal_iterator<int const*, std::vector<int, std::allocator<int> > >, __gnu_cxx::__normal_iterator<int const*, std::vector<int, std::allocator<int> > >, int) const+0x5f/0x1dc
_ZNSt12_Bind_simpleIFZ16threaded_processI31concurrent_map_of_list_of_listsEvRT_RKSt6vectorIiSaIiEEEUlN9__gnu_cxx17__normal_iteratorIPKiS6_EESD_iE_SD_SD_iEE9_M_invokeIJLm0ELm1ELm2EEEEvSt12_Index_tupleIJXspT_EEE+0x7c/0x87
std::_Bind_simple<void threaded_process<concurrent_map_of_list_of_lists>(concurrent_map_of_list_of_lists&, std::vector<int, std::allocator<int> > const&)::{lambda(__gnu_cxx::__normal_iterator<int const*, std::vector<int, std::allocator<int> > >, __gnu_cxx::__normal_iterator<int const*, std::vector<int, std::allocator<int> > >, int)#1} (__gnu_cxx::__normal_iterator<int const*, std::vector<int, std::allocator<int> > >, __gnu_cxx::__normal_iterator<int const*, std::vector<int, std::allocator<int> > >, int)>::operator()()+0x1b/0x28
std::thread::_Impl<std::_Bind_simple<void threaded_process<concurrent_map_of_list_of_lists>(concurrent_map_of_list_of_lists&, std::vector<int, std::allocator<int> > const&)::{lambda(__gnu_cxx::__normal_iterator<int const*, std::vector<int, std::allocator<int> > >, __gnu_cxx::__normal_iterator<int const*, std::vector<int, std::allocator<int> > >, int)#1} (__gnu_cxx::__normal_iterator<int const*, std::vector<int, std::allocator<int> > >, __gnu_cxx::__normal_iterator<int const*, std::vector<int, std::allocator<int> > >, int)> >::_M_run()+0x1c/0x1e
std::error_code::default_error_condition() const+0x40/0xc0
start_thread+0xd0/0x300
clone+0x6d/0x90

The thing is when heap space exhausted, libc calls grow_heap() , which usually adds only one page, but then it calls mprotect() which calls validate_mm() in kernel. validate_mm() seem to lock entire VMA using semaphore. I replaced default list allocator with tbb::scalable_allocator , it rocks! Now tbb is 2 times faster than uni-processor approach.

Thanks for your help, I will use @Dave approach with chunks of job in std::async .

Answer 1

If f(current, s, q) and copying current are trivially cheap it's going to be hard to make going wide with multithreading worth the overhead. However, I think I would use a lock free hash/unordered map ( tbb::concurrent_hash_map ? I don't know tbb) and launch the entire inner for loop with std::async . The idea is to launch a decent sized chunk of work with std::async , if it's too tiny and you launch a million trivial tasks the overhead of using std::async will eclipse the work the task has to do!

Also to note, when you use std::async you need to save off the returned future somewhere or it ends up blocking until the task is done in the future 's destructor, buying you multithreading overhead and no parallel processing at all. You may be running into that now. It's very obnoxious and I wish it didn't work that way.

Efficiently/concurrently insert into unordered_map<>

Question

1 answers

solution1
3 ACCPTED 2015-03-12 17:35:18

Efficiently/concurrently insert into unordered_map<>

Question

1 answers

solution1 3 ACCPTED 2015-03-12 17:35:18

solution1
3 ACCPTED 2015-03-12 17:35:18