简体   繁体   English

如何将递归函数的线程与子线程同步

[英]How to synchronize thread for recursive function with sub-threads

I am quite new to C++ and threading, and I got stuck in this problem for days.. It's supposed to form the base code for a fft (fast fourier transform) -- just a base code so several things are still lacking such as the twiddle terms, and inputs are double numbers (not yet complex numbers). 我是C ++和线程技术的新手,几天来就一直陷在这个问题中。它应该形成fft(快速傅立叶变换)的基本代码-只是基本代码,因此仍然缺少一些东西,例如旋转项,输入是双数(尚未复数)。

I want to do some parallel programming of a function f_thread with C++... here's a working 'compilable' code 我想用C ++对函数f_thread进行并行编程...这是一个有效的“可编译”代码

#include<iostream>
#include<thread>
#include <vector>
#include <mutex>

void get_odd_elements(std::vector<double> inpt, std::vector<double> &out) {
    for (int i = 0; i < inpt.size()-1; i = i + 2) {out[i/2] = inpt[i];}
}

void get_even_elements(std::vector<double> inpt, std::vector<double> &out) {
    for (int i = 1; i < inpt.size(); i = i + 2) {out[i/2] = inpt[i];}
}

void attach(std::vector<double> a, std::vector<double> b, std::vector<double> &out) {
    for (int i = 0; i < a.size(); i++) {out[i] = a[i];}
    for (int i = a.size(); i < a.size()+b.size(); i++) {out[i] = b[i];}
}

void add_vectors(std::vector<double> &x, std::vector<double> &y, std::vector<double> &z) {for (int i = 0; i < x.size(); i++) {z[i] = x[i] + y[i];}}

void sub_vectors(std::vector<double> &x, std::vector<double> &y, std::vector<double> &z) {for (int i = 0; i < x.size(); i++) {z[i] = y[i] - x[i];}}

//the f_thread function

void f_thread(std::vector<double> in, std::vector<double> &out) {

    if (in.size() == 1) {out = in;}
    else {

        std::vector<double> f0(in.size()/2);
        std::vector<double> f1(in.size()/2);

        get_odd_elements(in,std::ref(f0)); //get_odd_elements is a function that gets all odd-indexed elements of f
        get_even_elements(in,std::ref(f1)); //get_even_elements is a function that gets all even-indexed elements of in

        std::vector<double> a(f0.size());
        std::vector<double> b(f1.size());

        std::mutex mtx1; std::mutex mtx2;

        std::thread t0(f_thread,std::ref(f0),std::ref(a)); //create thread for f_thread on a
        std::thread t1(f_thread,std::ref(f1),std::ref(b)); //create thread for f_thread on b

        t0.join(); t1.join(); // join 2 threads

        std::vector<double> a_out(f0.size());
        std::vector<double> b_out(f1.size());

        add_vectors(std::ref(a),std::ref(b),std::ref(a_out)); //call add_vectors function : a + b
        sub_vectors(std::ref(a),std::ref(b),std::ref(b_out)); //call sub_vectors function : b - a

        std::vector<double> f_out(in.size());
        attach(a_out,b_out,std::ref(f_out)); //attach is a function that appends b to the end of a
        out = f_out; 
    }
}

int main() {
    int n_elements = 16;
    std::vector<double> sample_input(n_elements);
    for (int i = 0; i < n_elements; i++) {sample_input[i] = i;}
    std::vector<double> output(n_elements);
    std::thread start(f_thread,std::ref(sample_input),std::ref(output));
    start.join();
    for (int i = 0; i < n_elements; i++) {std::cout << "output element "; std::cout << i; std::cout << ": "; std::cout << output[i]; std::cout<< "\n";}
    }

So f_thread is initialized as a thread and then creates 2 sub-threads that recursively call f_thread . 因此,将f_thread初始化为线程,然后创建2个子线程以递归方式调用f_thread I tried several tricks using mutexes, but none seem to work since synchronization between the 2 sub-threads are not going wel (it's a hotspot for race conditions). 我尝试了使用互斥锁的几种技巧,但是似乎没有用,因为两个子线​​程之间的同步不理想(这是竞争条件的热点)。 Here's one code that I tried and which did not work. 这是我尝试的一个代码,但是没有用。 I also tried using global recursive mutexes but still no improvement. 我也尝试使用全局递归互斥体,但仍无改善。

#include<iostream>
#include<thread>
#include <vector>
#include <mutex>

void get_odd_elements(std::vector<double> inpt, std::vector<double> &out) {
    for (int i = 0; i < inpt.size()-1; i = i + 2) {out[i/2] = inpt[i];}
}

void get_even_elements(std::vector<double> inpt, std::vector<double> &out) {
    for (int i = 1; i < inpt.size(); i = i + 2) {out[i/2] = inpt[i];}
}

void attach(std::vector<double> a, std::vector<double> b, std::vector<double> &out) {
    for (int i = 0; i < a.size(); i++) {out[i] = a[i];}
    for (int i = a.size(); i < a.size()+b.size(); i++) {out[i] = b[i];}
}

void add_vectors(std::vector<double> &x, std::vector<double> &y, std::vector<double> &z) {for (int i = 0; i < x.size(); i++) {z[i] = x[i] + y[i];}}

void sub_vectors(std::vector<double> &x, std::vector<double> &y, std::vector<double> &z) {for (int i = 0; i < x.size(); i++) {z[i] = y[i] - x[i];}}

//the f_thread function

void f_thread(std::vector<double> in, std::vector<double> &out) {

    if (in.size() == 1) {out = in;}
    else {

        std::vector<double> f0(in.size()/2);
        std::vector<double> f1(in.size()/2);

        get_odd_elements(in,std::ref(f0)); //get_odd_elements is a function that gets all odd-indexed elements of f
        get_even_elements(in,std::ref(f1)); //get_even_elements is a function that gets all even-indexed elements of in

        std::vector<double> a(f0.size());
        std::vector<double> b(f1.size());

        std::mutex mtx1; std::mutex mtx2;

        mtx1.lock(); std::thread t0(f_thread,std::ref(f0),std::ref(a)); mtx1.unlock(); //create thread for f_thread on a
        mtx2.lock(); std::thread t1(f_thread,std::ref(f1),std::ref(b)); mtx2.unlock(); //create thread for f_thread on b

        t0.join(); t1.join(); // join 2 threads

        std::vector<double> a_out(f0.size());
        std::vector<double> b_out(f1.size());

        add_vectors(std::ref(a),std::ref(b),std::ref(a_out)); //call add_vectors function : a + b
        sub_vectors(std::ref(a),std::ref(b),std::ref(b_out)); //call sub_vectors function : b - a

        std::vector<double> f_out(in.size());
        attach(a_out,b_out,std::ref(f_out)); //attach is a function that appends b to the end of a
        out = f_out; 
    }
}

int main() {
    int n_elements = 16;
    std::vector<double> sample_input(n_elements);
    for (int i = 0; i < n_elements; i++) {sample_input[i] = i;}
    std::vector<double> output(n_elements);
    std::thread start(f_thread,std::ref(sample_input),std::ref(output));
    start.join();
    for (int i = 0; i < n_elements; i++) {std::cout << "output element "; std::cout << i; std::cout << ": "; std::cout << output[i]; std::cout<< "\n";}
    }

I got to verify that this code compiles using g++ f_thread.cpp -pthread with standard C++ libraries in a linux (ubuntu 18.04) OS 我必须验证此代码是否可以在Linux(ubuntu 18.04)操作系统中使用带有标准C ++库的g ++ f_thread.cpp -pthread进行编译

The code now runs (no more 'aborted core dumped errors'), but the output for the threaded version changes at each run (indicating that synchronization is not working well). 该代码现在可以运行(不再有“异常终止的核心转储错误”),但是线程版本的输出在每次运行时都会更改(表明同步工作不正常)。

For reference, here is the sequential version of the code that doesn't use sub-threads and which works well (ie no changes in output every time it is run) 作为参考,下面是不使用子线程且运行良好的顺序版本代码(即,每次运行时输出均无变化)

// WORKING sequential version

#include<iostream>
#include<thread>
#include <vector>
#include <mutex>

void get_odd_elements(std::vector<double> inpt, std::vector<double> &out) {
    for (int i = 0; i < inpt.size()-1; i = i + 2) {out[i/2] = inpt[i];}
}

void get_even_elements(std::vector<double> inpt, std::vector<double> &out) {
    for (int i = 1; i < inpt.size(); i = i + 2) {out[i/2] = inpt[i];}
}

void attach(std::vector<double> a, std::vector<double> b, std::vector<double> &out) {
    for (int i = 0; i < a.size(); i++) {out[i] = a[i];}
    for (int i = a.size(); i < a.size()+b.size(); i++) {out[i] = b[i];}
}

void add_vectors(std::vector<double> &x, std::vector<double> &y, std::vector<double> &z) {for (int i = 0; i < x.size(); i++) {z[i] = x[i] + y[i];}}

void sub_vectors(std::vector<double> &x, std::vector<double> &y, std::vector<double> &z) {for (int i = 0; i < x.size(); i++) {z[i] = y[i] - x[i];}}

//the f_thread function

void f_thread(std::vector<double> in, std::vector<double> &out) {

    if (in.size() == 1) {out = in;}
    else {

        std::vector<double> f0(in.size()/2);
        std::vector<double> f1(in.size()/2);

        get_odd_elements(in,std::ref(f0)); //get_odd_elements is a function that gets all odd-indexed elements of f
        get_even_elements(in,std::ref(f1)); //get_even_elements is a function that gets all even-indexed elements of in

        std::vector<double> a(f0.size());
        std::vector<double> b(f1.size());

        f_thread(std::ref(f0),std::ref(a)); // no thread, just call recursion 

        f_thread(std::ref(f1),std::ref(b)); // no thread, just call recursion 

        std::vector<double> a_out(f0.size());
        std::vector<double> b_out(f1.size());

        add_vectors(std::ref(a),std::ref(b),std::ref(a_out)); //call add_vectors function : a + b
        sub_vectors(std::ref(a),std::ref(b),std::ref(b_out)); //call sub_vectors function : b - a

        std::vector<double> f_out(in.size());
        attach(a_out,b_out,std::ref(f_out)); //attach is a function that appends b to the end of a
        out = f_out; 
    }
}

int main() {
    int n_elements = 16;
    std::vector<double> sample_input(n_elements);
    for (int i = 0; i < n_elements; i++) {sample_input[i] = i;}
    std::vector<double> output(n_elements);
    std::thread start(f_thread,std::ref(sample_input),std::ref(output));
    start.join();
    for (int i = 0; i < n_elements; i++) {std::cout << "output element "; std::cout << i; std::cout << ": "; std::cout << output[i]; std::cout<< "\n";}
    }

The results are supposed to be fixed to this output every time the code is run. 每次运行代码时,结果都应该固定在该输出上。

output element 0: 120
output element 1: 0
output element 2: 0
output element 3: 7.31217e-322
output element 4: 0
output element 5: 6.46188e-319
output element 6: 56
output element 7: 0
output element 8: 0
output element 9: 4.19956e-322
output element 10: 120
output element 11: 0
output element 12: 0
output element 13: 7.31217e-322
output element 14: 0
output element 15: 6.46188e-319

This is not a threading error but out-of-bounds access to array elements in function attach : 这不是线程错误,而是对函数attach数组元素的越界访问:

void attach(std::vector<double> a, std::vector<double> b, std::vector<double> &out) {
    for (int i = 0; i < a.size(); i++) {out[i] = a[i];}
    for (int i = a.size(); i < a.size()+b.size(); i++) {out[i] = b[i];}
}

In the second loop the index starts from a.size() , not from 0 - but you use it to access elements of b as if it started from 0. 在第二个循环中,索引从a.size()开始,而不是从0开始-但是您可以使用它来访问b元素,就像它从0开始一样。

Instead of writing loops, you could use std::copy from <algorithm> : 除了编写循环,还可以使用<algorithm> std::copy

void attach(std::vector<double> a, std::vector<double> b, std::vector<double> &out) {
    std::copy(a.begin(), a.end(), out.begin());
    std::copy(b.begin(), b.end(), out.begin()+a.size());
}

After that, for recursive threading you only need this: 在那之后,对于递归线程,您只需要这样做:

std::thread t0(f_thread,std::ref(f0),std::ref(a)); //create thread for f_thread on a
std::thread t1(f_thread,std::ref(f1),std::ref(b)); //create thread for f_thread on b
t0.join(); t1.join(); // join 2 threads

There is no races, since each thread works with separate input and output arrays (which you created on the stack of a "parent" thread). 因为每个线程都使用单独的输入和输出数组(您在“父”线程的堆栈上创建),所以没有种族。 The result is deterministic and the same for sequential and threaded versions: 结果是确定性的,对于顺序版本和线程版本,结果相同:

output element 0: 120
output element 1: 64
output element 2: 32
output element 3: 0
output element 4: 16
output element 5: 0
output element 6: 0
output element 7: 0
output element 8: 8
output element 9: 0
output element 10: 0
output element 11: 0
output element 12: 0
output element 13: 0
output element 14: 0
output element 15: 0

BTW you could have guessed that even your serial version is incorrect, because the input data are all integer numbers and you only copy, add and subtract those; 顺便说一句,您可能甚至猜测您的序列号也是不正确的,因为输入数据都是整数,而您只复制,添加和减去这些数字即可。 so there is no reason for floating-point numbers like 7.31217e-322 to appear in the output. 因此没有理由让诸如7.31217e-322类的7.31217e-322出现在输出中。

Also please pay attention to Davis Herring`s comments: you copy the data a lot between vectors. 还请注意戴维斯·赫林(Davis Herring)的评论:在向量之间大量复制数据。 At the very least, I would pass the vectors to functions by const references instead of by values (except if it is known that these copies are eliminated). 至少,我将通过const引用而不是通过值将向量传递给函数(除非已知消除了这些副本)。

Finally, you should stop creating new threads much earlier than when your input arrays are of size 1. For real problem sizes, you might not be able to create thousands of threads; 最后,您应该比输入数组的大小为1的时候更早地停止创建新线程。对于实际的问题大小,您可能无法创建数千个线程。 and even if you succeed in that, the overheads of creating and running that many threads will make your code run very very slowly. 即使成功了,创建和运行那么多线程的开销也会使您的代码运行非常缓慢。 Ideally, you should not create more threads than there are HW cores on the machine where the code runs. 理想情况下,您创建的线程不应超过运行代码的计算机上的硬件核心。

You should handle this by asking how many cpus there are, then splitting your work up and using a queue to join it back together. 您应该通过询问有多少cpus来处理此问题,然后拆分工作并使用队列将其重新结合在一起。

I don't know the FFT algorithm, but from looking over your code cursorily, it looks like you basically split your data up using a finer and finer toothed comb. 我不知道FFT算法,但是通过粗略地查看代码,看起来您基本上是使用越来越细的齿梳将数据分割开来的。 Except you start at the finest level and work your way up, which isn't such a great way to split things up. 除了从最好的层次开始并逐步提高,这不是拆分事物的好方法。

You don't want a different CPU handling every other value because even on a single-chip multi-core CPU, there are multiple L1 caches. 您不希望其他CPU处理其他所有值,因为即使在单芯片多核CPU上,也存在多个L1缓存。 Each L1 cache is shared with at most one other core. 每个L1缓存最多与另一个内核共享。 So you want all the values a single CPU deals with to be close to each other to maximize the chance that a value you're looking for is in the cache. 因此,您希望单个CPU处理的所有值都彼此接近,以最大程度地增加您要查找的值在缓存中的机会。

So you should start your splitting with the largest contiguous chunks. 因此,您应该从最大的连续块开始分割。 Because the FFT algorithm works based on powers of two, you should count the number of cores you have. 由于FFT算法基于2的幂进行工作,因此您应计算拥有的内核数。 Use thread::hardware_concurrency() to count. 使用thread::hardware_concurrency()进行计数。 Then round up to the next highest power of two and split your problem into that number of sub-FFTs. 然后取整到下一个最高的2的幂,然后将问题分解为该数量的子FFT。 Then combined their results in the main thread. 然后将其结果合并到主线程中。

I have a program I wrote that sort of does what you want. 我有一个程序,我写了那种你想要的东西。 It splits up a list into a number of chunks to run sort on . 它将列表分成许多块以对进行排序 Then it has a queue of merges that need to be done. 然后它有一个需要完成的合并队列。 Each chunk is handled by a separate thread, and each merge is also spawned out into it's own thread. 每个块都由一个单独的线程处理,并且每个合并也都派生到它自己的线程中。

I divide the number of cores in two because of a feature of modern CPUs that I'm not fond of called hyperthreading. 由于不喜欢现代CPU的功能,我将内核数一分为二。 I could've just ignored that though and it would've run fine, though since the main contention would've been over the integer ALU, it might'be been a tad slower. 我本可以忽略这一点,但它会运行的很好,尽管由于主要争执已经超过了整数ALU,所以它可能会慢一点。 (Hyperthreading shares resources within a single core.) (超线程在单个内核中共享资源。)

From the other answer it sounds like your FFT code has a few bugs. 从另一个答案看来,您的FFT代码有一些错误。 I would recommend getting it to work with just one thread, then figuring out how to split it up. 我建议将其仅与一个线程一起使用,然后弄清楚如何将其拆分。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM