I need to gather some statistics for my project with the following algorithm (in Python):
stats = defaultdict(list)
for s in L:
current = []
for q in L:
stats[s, q] = copy(current)
current = f(current, s, q)
Because list L
is large and f()
and copying current
take some time and project main language is C++ I decided to choose C++ and use its multithreading capabilities to implement my algorithm.
I moved that part:
stats[s, q] = copy(current)
current = f(current, s, q)
to a separate std::async
, and locked std::mutex
while inserting to stats
but that made things slower. I tried to use tbb::concurrent_ordered_map
but that made things worse.
I wrote benchmark that reproduces its behavior: https://gist.github.com/myaut/94ee59d9752f524d3da8
Results for 2 x Xeon E5-2420 with Debian 7 for 800 entries in L:
single-threaded 8157ms
async mutex 10145ms
async with chunks mutex 9618ms
threads mutex 9378ms
async tbb 10173ms
async with chunks tbb 9560ms
threads tbb 9418ms
I do not understand why TBB is slowest (it seem tbb::concurrent_ordered_map
allocates larger amounts of memory, but for what). Are there any other options that can help me?
EDIT : I've updated my benchmark with suggested approaches (and reduced N to 800). It seem that problem is somewhere else...
async
handles bundle of 20 sequential element of list EDIT2 : I found one of the issues -- app consumes large amount of memory, so Xen Hypervisor became a bottleneck -- rebooted in native mode, now multithreaded modes it is only bit slower than uni-threaded
EDIT3 : It seems the issue with multithreading is huge amount of allocations when copying list
:
mprotect()
_int_malloc+0xcba/0x13f0
__libc_malloc+0x70/0x260
operator new(unsigned long)+0x1d/0x90
__gnu_cxx::new_allocator<int>::allocate(unsigned long, void const*)+0x40/0x42
std::_Vector_base<int, std::allocator<int> >::_M_allocate(unsigned long)+0x2f/0x38
std::_Vector_base<int, std::allocator<int> >::_M_create_storage(unsigned long)+0x23/0x58
std::_Vector_base<int, std::allocator<int> >::_Vector_base(unsigned long, std::allocator<int> const&)+0x3b/0x5e
std::vector<int, std::allocator<int> >::vector(std::vector<int, std::allocator<int> > const&)+0x55/0xf0
void threaded_process<concurrent_map_of_list_of_lists>(concurrent_map_of_list_of_lists&, std::vector<int, std::allocator<int> > const&)::{lambda(__gnu_cxx::__normal_iterator<int const*, std::vector<int, std::allocator<int> > >, __gnu_cxx::__normal_iterator<int const*, std::vector<int, std::allocator<int> > >, int)#1}::operator()(__gnu_cxx::__normal_iterator<int const*, std::vector<int, std::allocator<int> > >, __gnu_cxx::__normal_iterator<int const*, std::vector<int, std::allocator<int> > >, int) const+0x5f/0x1dc
_ZNSt12_Bind_simpleIFZ16threaded_processI31concurrent_map_of_list_of_listsEvRT_RKSt6vectorIiSaIiEEEUlN9__gnu_cxx17__normal_iteratorIPKiS6_EESD_iE_SD_SD_iEE9_M_invokeIJLm0ELm1ELm2EEEEvSt12_Index_tupleIJXspT_EEE+0x7c/0x87
std::_Bind_simple<void threaded_process<concurrent_map_of_list_of_lists>(concurrent_map_of_list_of_lists&, std::vector<int, std::allocator<int> > const&)::{lambda(__gnu_cxx::__normal_iterator<int const*, std::vector<int, std::allocator<int> > >, __gnu_cxx::__normal_iterator<int const*, std::vector<int, std::allocator<int> > >, int)#1} (__gnu_cxx::__normal_iterator<int const*, std::vector<int, std::allocator<int> > >, __gnu_cxx::__normal_iterator<int const*, std::vector<int, std::allocator<int> > >, int)>::operator()()+0x1b/0x28
std::thread::_Impl<std::_Bind_simple<void threaded_process<concurrent_map_of_list_of_lists>(concurrent_map_of_list_of_lists&, std::vector<int, std::allocator<int> > const&)::{lambda(__gnu_cxx::__normal_iterator<int const*, std::vector<int, std::allocator<int> > >, __gnu_cxx::__normal_iterator<int const*, std::vector<int, std::allocator<int> > >, int)#1} (__gnu_cxx::__normal_iterator<int const*, std::vector<int, std::allocator<int> > >, __gnu_cxx::__normal_iterator<int const*, std::vector<int, std::allocator<int> > >, int)> >::_M_run()+0x1c/0x1e
std::error_code::default_error_condition() const+0x40/0xc0
start_thread+0xd0/0x300
clone+0x6d/0x90
The thing is when heap space exhausted, libc calls grow_heap()
, which usually adds only one page, but then it calls mprotect()
which calls validate_mm()
in kernel. validate_mm()
seem to lock entire VMA using semaphore. I replaced default list
allocator with tbb::scalable_allocator
, it rocks! Now tbb
is 2 times faster than uni-processor approach.
Thanks for your help, I will use @Dave approach with chunks of job in std::async
.
If f(current, s, q)
and copying current
are trivially cheap it's going to be hard to make going wide with multithreading worth the overhead. However, I think I would use a lock free hash/unordered map ( tbb::concurrent_hash_map
? I don't know tbb) and launch the entire inner for loop with std::async
. The idea is to launch a decent sized chunk of work with std::async
, if it's too tiny and you launch a million trivial tasks the overhead of using std::async
will eclipse the work the task has to do!
Also to note, when you use std::async
you need to save off the returned future
somewhere or it ends up blocking until the task is done in the future
's destructor, buying you multithreading overhead and no parallel processing at all. You may be running into that now. It's very obnoxious and I wish it didn't work that way.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.