std::tuple faster than std::array?

Question

I have a 2 dimensional array with fixed number of columns=4, but the number of rows is very large. I found out using vector<tuple> instead of vector<array> or was nearly 2x faster, and 5x faster than vector<vector> . Though using a tuple is quite a headache. I defined a access function to get around it:

typedef tuple<int,int,int,int> Tuple;
// access the i-th element
inline int &tget (Tuple &t, int i) {
    switch (i) {
        case 0: return std::get<0>(t); 
        case 1: return std::get<1>(t); 
        case 2: return std::get<2>(t); 
        case 3: return std::get<3>(t); 
    }   
}
vector<Tuple> array; 
// this gives the element in i-th row and j-th column
tget(array[i],j);

I found this quite surprising and was wondering where this performance boost coming from, and if there is a neat solution with a syntax similar to vector<vector> or vector<array> using subscript operator[] with the same performance as tuple ?

UPDATE : I'm using gcc4.8 with c++11 standards and I use -O3 for optimization. Because people asked about the operations The operations are quite basic. First the 2d array is filled up with up to 1M rows. First I reserve the space and then use push_back function to add elements. Then I sort the array, and finally I do a binary search on it.

This is the code block I wrote. The input will be several test cases (=T) and each test case takes an integer (N up to 20), and then reads N more integers in lens . Then because of the exponential nature of the function the sums get quite big (4^(N/2) up to roughly a million). To test between array and tuple, toggle the typedef and comment/uncomment the return line in the access function. There are four lines/blocks that have to be commented/uncommented to switch back and forth between array and tuple , and the lines are designated in the comments ( // if array // if tuple comments) :

#include <iostream>
#include <algorithm>
#include <vector>
#include <tuple>
#include <array>
using namespace std;

typedef vector<int>::iterator VecIt;

// HERE you can switch between the two types 
typedef tuple<int,int,int,int> Tuple;       // define it as a tuple
//typedef array<int,4> Tuple;               // define it as array
inline int &tget (Tuple &t, int i) {
  //  return t[i];                          // if it's an array, uncomment This too

  // if it's an array comment the switch case                                                                                                                                                                                                                                                                                                                                                                                                                                               
    switch (i) {
        case 0: return std::get<0>(t);
        case 1: return std::get<1>(t);
        case 2: return std::get<2>(t);
        case 3: return std::get<3>(t);
        default:
          cerr << "tuple index out of bound" << endl;
    }
}

void comp_sums(VecIt v, VecIt vend, vector<Tuple> &sums){
    if (v==vend)
        return;
    int size = sums.size();
    for (int i=0; i<size; i++) {
        for (int j=1; j<4; j++){
            sums.push_back(sums[i]);
            tget(sums.back(),j) += *v;
        }
        tget(sums[i],0) += *v;
    }
    v++;
    comp_sums(v, vend, sums);
}

void docase( ) {
    int N;
    cin >> N;
    vector<int> lens(N);
    int lsum = 0;
    for (int i=0; i<N; i++) {
        cin >> lens[i];
        lsum += lens[i];
    }
    if (lsum % 4 != 0 ) {
        cout << 0 << endl;
        return;
    }
    lsum = lsum/4;

    vector<Tuple> sums1, sums2;
    Tuple z(0,0,0,0);                             // if it's a tuple
    //Tuple z = {0,0,0,0};                        // If it's an array 
    sums1.push_back(z); sums1.reserve(1<<N/2);
    sums2.push_back(z); sums2.reserve(1<<N/2);
    comp_sums(lens.begin()+1, lens.begin()+N/2, sums1);
    comp_sums(lens.begin()+N/2+1, lens.end(), sums2);
    sort(sums1.begin(), sums1.end());
    sort(sums2.begin(), sums2.end());
    long count = 0;
    for (int i=0; i<sums1.size(); i++) {
        for (int k=0; k<4; k++) {
            //Tuple t({lsum,lsum,lsum,lsum});     // if it's an array
            Tuple t(lsum,lsum,lsum,lsum);         // if it's a tuple
            for (int j=0; j<4; j++)
                tget(t,j) -= tget(sums1[i],(j+k)%4);
            tget(t,k) -= lens[0];
            tget(t,0) -= lens[N/2];
            bool neg = false;
            for (int j=0; j<4; j++)
                if (tget(t,j)<0){
                    neg = true;
                    break;
                }
            if (neg)
                continue;
            auto LB = lower_bound(sums2.begin(), sums2.end(), t);
            auto UB = upper_bound(LB, sums2.end(), t);
            count += (UB - LB);
        }
    }
    cout << count/6 << endl;
}


int main() {
    std::ios_base::sync_with_stdio(false);
    int T;
    cin >> T;
    while (T--) docase();
}

Here is also a test input I use. The first line says there are T=18 separate test cases. First line for each test case is N (up to 20), and then N integers follow. Here it is. The algorithm has exponential growth so the small numbers don't mean it's only a few operations. Also in measuring time I have accounted for the I/O:

18
8
132391 123854 21536 19482 133025 10945 121800 10311
8                                                                                                                                                             
12 4 12 4 4 12 12 24
8
131723 16253 23309 132227 125171 12253 16757 136227
8
8 4 4 8 4 8 12 24
8
12 12 12 8 4 8 12 28
8
115021 114654 112093 17443 20371 17810 17274 115190
8
8 8 4 4 12 4 12 20
8
12 4 4 4 4 12 12 28
8
8 12 8 12 8 4 12 28
8
4 12 8 12 8 8 4 24
20
34836 56869 24112 27796 198091 196472 50364 27364 29572 53701 52981 25779 202644 204884 53840 30056 48705 53092 43751 18419
20
207447 26867 41968 35977 36384 57042 40981 30191 47705 26907 248361 26003 53715 48237 27691 192772 60507 58194 20754 245913
20
34836 56869 24112 27796 198091 196472 50364 27364 29572 53701 52981 25779 202644 204884 53840 30056 48705 53092 43751 18419
20
207447 26867 41968 35977 36384 57042 40981 30191 47705 26907 248361 26003 53715 48237 27691 192772 60507 58194 20754 245913
20
4 2 2 4 4 2 4 2 4 2 2 4 4 3 3 4 2 3 4 1
20
2 3 2 2 3 2 3 4 3 2 4 4 4 3 2 3 4 4 3 3
20
4 4 4 4 3 3 3 2 2 4 2 4 2 2 4 2 2 2 2 1
20
4 4 2 3 4 3 4 3 2 4 4 2 2 4 3 4 3 3 2 4

Answer 1

Another question is similar to your question : Is it fine to replace an array with hierarchal structures and classes?

It has code and benchmark.

What is basically happening is that the tuple creates a structure which the compiler is able to better optimize operations on, than on an array of elements.

Answer 2

My guess is that the bottleneck is operator< of std::tuple vs. std::array . If you try with this user defined structure:

struct my_struct {
    int a, b, c, d;
};

inline bool operator<(my_struct const& lhs, my_struct const& rhs) {
    return std::tie(lhs.a, lhs.b, lhs.c, lhs.d) <
        std::tie(rhs.a, rhs.b, rhs.c, rhs.d);
}

This is about 5 times slower than std::tuple or std::array (tested on both my computer with MinGW and Wandbox).

Furthermore, the operator< assembly for std::tuple and std::array are different (with -O3 ), which may account for the performance difference between std::tuple and std::array .

fun(std::array<int, 4ul> const&, std::array<int, 4ul> const&):
  mov ecx, DWORD PTR [rdi]
  cmp DWORD PTR [rsi], ecx
  lea rdx, [rsi+16]
  lea rax, [rdi+16]
  jg .L32
  jl .L31
.L25:
  add rdi, 4
  add rsi, 4
  cmp rax, rdi
  je .L36
  mov ecx, DWORD PTR [rsi]
  cmp DWORD PTR [rdi], ecx
  jl .L32
  jle .L25
.L31:
  xor eax, eax
  ret
.L32:
  mov eax, 1
  ret
.L36:
  cmp rdx, rsi
  setne al
  ret

fun(std::tuple<int, int, int, int> const&, std::tuple<int, int, int, int> const&):
  mov edx, DWORD PTR [rsi+12]
  cmp DWORD PTR [rdi+12], edx
  mov eax, 1
  jl .L11
  mov eax, 0
  jg .L11
  mov ecx, DWORD PTR [rsi+8]
  cmp DWORD PTR [rdi+8], ecx
  mov eax, 1
  jl .L11
  mov eax, 0
  jg .L11
  mov ecx, DWORD PTR [rsi+4]
  cmp DWORD PTR [rdi+4], ecx
  mov eax, 1
  jl .L11
  mov eax, 0
  jg .L11
  mov eax, DWORD PTR [rsi]
  cmp DWORD PTR [rdi], eax
  setl al
.L11:
  rep ret

std::tuple faster than std::array?

Question

2 answers

solution1
1 2018-01-25 17:49:36

solution2
1 ACCPTED 2018-01-25 18:25:04

std::tuple faster than std::array?

Question

2 answers

solution1 1 2018-01-25 17:49:36

solution2 1 ACCPTED 2018-01-25 18:25:04

solution1
1 2018-01-25 17:49:36

solution2
1 ACCPTED 2018-01-25 18:25:04