简体   繁体   中英

How is vector<vector<int>> "heavier" than vector<pair<int,int>>?

During a recent interview, I suggested using vector<pair<int,int>> over vector<vector<int>> since we only wanted to store two values for every entry in the vector. I said something to the tune of "we should use vector<pair<int,int>> over vector<vector<int>> since the latter is heavier than the former".

After the coding session was over, they said it was a good idea to use pair over a vector and asked me to elaborate what I meant by "heavier" earlier. I wasn't able to elaborate, unfortunately. Yes, I know we can enter only two values in a pair but many more in a vector and that vector gets resized automatically when its size==capacity, etc. but how should I have answered their question - why specifically was using vector<pair<int,int>> better than vector<vector<int>> ? What extra things are done in the latter case?

Each vector is a single contiguous area of memory, dynamically allocated.

Let's say that you have 1000 values you'll be working with.

std::vector<std::pair<int, int>>

This gets you a single, contiguous block of memory, for 2000 integers.

std::vector<std::vector<int>>

This gets you a single contiguous block of memory for 1000 vectors.

Each one of those 1000 std::vector s gets you another contiguous block of memory for just two integers.

So, instead of one single contiguous block of memory, for this data structure, it will consist of 1001 blocks of memory scattered all over. You have no guarantees, whatsoever, that all those blocks of memory will be contiguous, one after another.

Each dynamic memory allocation comes at a cost. The cost is fairly small but it adds up very, very quickly. A single penny is easily ignored. A thousand pennies should be enough to get you a cup of coffee at Starbucks.

Furthermore, modern CPUs are very, very good at accessing contiguous blocks of memory. Iterating over a single contiguous block of memory to add up two thousand int s will be much, much faster than doing the same over a thousand disjointed sections of memory.

You can answer this without reference to any particular language. The problem called for storing a sequence of 2-tuples. Your chosen type should be capable of storing 2-tuples, of course, but also be incapable of storing tuples of other sizes. So given two types that are both capable of storing the desired values, prefer the one that is less capable of storing undesired values.

vector<int> would allow you to store 2-element vectors, but also empty vectors, singleton vectors, 3-element vectors, 4-element vectors, etc. pair<int,int> is more precise , since it can only store exactly two values.

(Not to discount the performance benefits mentioned in the accepted answer, only to provide a purely semantic argument for using precise types.)

To simplify the explanation, lets say that

  • A[ a | b ] B[ c ] A[ a | b ] B[ c ] means: a and b are in chunk A and c in chunck B.
  • Chunks here are continuous pieces of memory, so a is next to b

With that in mind, lets see an example: the memory usage of { { 1, 1 }, { 2, 2 },... }

For std::vector<<std::vector<int>>

  • A[ int size | ptr to B ]
  • B[ [ int size | ptr to C ] | [ int size | ptr to D ] |... ]
  • C[ 1 | 1 ]
  • D[ 2 | 2 ]

For std::vector<std::pair<int, int>>

  • A[ int size | ptr to B ]
  • B[ [ 1 | 1 ] | [ 2 | 2 ] |... ]

I think the example is very clear: there is one layer of indirection less when doing std::vector<std::pair<int, int>> . Meaning

  1. There is less memory consumption (you dont need an extra integer for the size and a pointer to a chunk for each element).
  2. To get a desired value you would do less steps (otherwise, you would have to first load and read the pointer and then with that address load the desired value).

As others mentioned, std::vector<int> adds for example a counter of the number of elements.

But an interesting aspect you could have suggested in the interview would be to use std::array<int, 2> . It should have a similar cost as std::pair<int, int> as it will store the numbers in a fixed-sized array. One advantage would be the API, which allows to use a[0] instead of a.first and also is easier to generalize when you may need to store, for example, three values per entry after some new features was added.

A vector is a dynamically-resizing array. You sacrifice some performance to get the ability to resize dynamically.

A vector of vectors ( vector<vector<int>> ) has that performance overhead for both the outer vector and each of its elements. With a vector of pairs ( vector<pair<int, int>> ), you don't have the latter. A pair is always of a fixed size, so you needn't worry about having to resize it as needed (and relocate it to another position in memory if needed).

My "simple" / "naive" answer would be:

A vector<pair<int, int>> knows that it will always be pairs ints, so can allocate memory accordingly (eg when the vector resizes), possible in one continous chunk. Also it only needs to keep track that it stores X pairs of ints, enabling fast access to those ints and keeping overhead to the minimum. Finally with that information available at compile time the compiler can (possibly) optimize the code better.

A vector<vector<int>> needs to be able to store X-times * any number of int. It is likely that the outer vector only stores the adresses of the inner vector (to facilitate fast access), which means your data is likely to be scattered all over the memory. Also the inner vectors need to keep track of the numbers of ints they contain (even though this number should always be two), adding unnecessary overhead to both storing and acessing the ints. Finally the compiler can make fewer assumptions about the structure of your data, reducing the potential for optimizations.

You can use a pair if you need any of its member functions or operators. Otherwise, a simple struct could be even lighter :

struct payload {
    int a {};
    int b {};
};

std::vector<payload> x { {1, 2}, {3, 4} };

When using the STL, it can be easy to forget that we can still use primitives and they're often more efficient.

Lean is beautiful : An std::pair<int, int> corresponds two exactly two integers. And that's exactly what you wanted: not more, not less.

And it's performant : There is no overhead; the C++ standard defines the pair as a simple struct. So no memory management overhead and direct access to the member, since everything that could be time consuming is prepared at compile time.

Here an example, for initialising a pair<int,int> and calling a function with its reference:

void test1(int a, int b) {
    auto x = std::make_pair(a,b);
    f(x);   
}

And here the code generated with gcc and global optimizer:

    sub     rsp, 24
    mov     DWORD PTR [rsp+8], edi
    lea     rdi, [rsp+8]
    mov     DWORD PTR [rsp+12], esi
    call    f(std::pair<int, int>&)
    add     rsp, 24
    ret

In comparison, doing the same with vector<int> generates 31 lines of assembler because of the dynamic allocation, but also the need to cope with allocation errors, and of course a more complex destruction when the vector is no longer needed. See here for the full details.

(To complete the picture, some algorithms can take advantage of this simplicity and offer a pair specialization)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM