简体   繁体   中英

Optimizing time performances unordered_map c++

I'm stuck in an optimization problem. I have a huge database (about 16M entries) which represents ratings given by different users to different items . From this database I have to evaluate a correlation measure between different users (ie I have to implement a similarity matrix). Fortunately this correlation matrix is symmetric, so I just have to calculate half of it.

Let me focus for example on the first column of the matrix: there are 135k users in total so I keep one user fixed and I find all the common rated items between this user and all the other ones (with a for loop). The time problem appears also if I compare the single user with 20k other users instead of 135k.

My approach is the following: first I query the DB to obtain for example all the data of the first 20k users (this takes time also with indexes implementation, but it doesn't bother me since I do it only once) and I stored everything in an unordered map using the userID as key; then for this unordered_map I use as bucket another unordered_map which stores all the ratings given by the user, this time using as key the itemID.

Then, in order to find the set of items that both have rated, I cycle on the user which have rated less items, searching if the other one have also rated the same stuff. The fastest data structures that I know are hashmaps, but for a single complete columns my algorithm takes 30s (just for 20k entries) which translates in WEEKS for the complete matrix.

The code is the following:

void similarity_matrix(sqlite3 *db, sqlite3 *db_avg, sqlite3 *similarity, long int tot_users, long int interval) {

long int n = 1;
double sim;
string temp_s;
vector<string> insert_query;
sqlite3_stmt *stmt;

std::cout << "Starting creating similarity matrix..." << std::endl;

string query_string = "SELECT * from usersratings where usersratings.user <= 20000;";
unordered_map<int, unordered_map<int, int>> users_map = db_query(query_string.c_str(), db);
std::cout << "Query time: " << duration_ << " s." << std::endl;

unordered_map<int, int> u1_map = users_map[1];

string select_avg = "SELECT * from averages;";
unordered_map<int, double> avg_map = avg_value(select_avg.c_str(), db_avg);


for (int i = 2; i <= tot_users; i++)
{
    unordered_map<int, int> user;
    int compare_id;

    if (users_map[i].size() <= u1_map.size()) {
        user = users_map[i];
        compare_id = 1;
    }
    else {
        user = u1_map;
        compare_id = i;
    }

    int matches = 0;
    double newnum = 0;
    double newden1 = 0;
    double newden2 = 0;

    unordered_map<int, int> item_map = users_map[compare_id];
    for (unordered_map<int, int>::iterator it = user.begin(); it != user.end(); ++it)
    {
        if (item_map.size() != 0) {
            int rating = item_map[it->first];
            if (rating != 0) {
                double diff1 = (it->second - avg_map[1]);
                double diff2 = (rating - avg_map[i]);
                newnum += diff1 * diff2;
                newden1 += pow(diff1, 2);
                newden2 += pow(diff2, 2);
            }
        }

    }
    sim = newnum / (sqrt(newden1) * sqrt(newden2));
}

std::cout << "Execution time for first column: " << duration << " s." << std::endl;
std::cout << "First column finished..." << std::endl;

}

This sticks to me as an immediate potential performance trap:

unordered_map<int, unordered_map<int, int>> users_map = db_query(query_string.c_str(), db);

If the size of each sub-map for each user is anywhere close to the number of users, then you have a quadratic complexity algorithm which is going to get exponentially slower the more users you have.

unordered_map does offer constant time search but it's still a search. The amount of instructions required to do it is going to dwarf, say, the cost of indexing an array, especially if there are many collisions which implies inner loops each time you try to search the map. It also isn't necessarily represented in a way that allows for the fastest sequential iteration. So if you can just use std::vector for at least the sub-lists and avg_map like so, that should help a lot for starters:

typedef pair<int, int> ItemRating;
typedef vector<ItemRating> ItemRatings;
unordered_map<int, ItemRatings> users_map = ...;
vector<double> avg_map = ...;

Even the outer users_map could be a vector unless it's sparse and not all indices are used. If it's sparse and the range of user IDs still fits into a reasonable range (not an astronomically large integer), you could potentially construct two vectors -- one which stores the user data and has a size proportional to the number of users, while another is proportional to the valid index range of users and stores nothing but indices into the former vector to translate from a user ID to an index with a simple array lookup if you need to be able to access user data through a user ID.

// User data array.
vector<ItemRatings> user_data(num_users);

// Array that translates sparse user ID integers to indices into the 
// above dense array. A value of -1 indicates that a user ID is not used.
// To fetch user data for a particular user ID, we do: 
// const ItemRatings& ratings = user_data[user_id_to_index[user_id]];
vector<int> user_id_to_index(biggest_user_index+1, -1);

You're also copying around those unordered_maps needlessly for each iteration of the outer loop. While I don't think that's the source of biggest bottleneck, it would help to avoid deep copying these data structures you don't even modify by using references or pointers:

// Shallow copy, don't deep copy big stuff needlessly.
const unordered_map<int, int>& user = users_map[i].size() <= u1_map.size() ?                                     
                                      users_map[i]: u1_map;
const int compare_id = users_map[i].size() <= u1_map.size() ? 1: i;
const unordered_map<int, int>& item_map = users_map[compare_id];
...

You also don't need to check if item_map is empty in the inner loop. That check should be hoisted outside. That's a micro-optimization which is unlikely to help much at all, but still eliminating blatant waste.

The final code after this first pass would be something like this:

vector<ItemRatings> user_data = ..;
vector<double> avg_map = ...;

// Fill `rating_values` with the values from the first user.
vector<int> rating_values(item_range, 0);
const ItemRatings& ratings1 = user_data[0];
for (auto it = ratings1.begin(); it != ratings1.end(); ++it)
{
    const int item = it->first;
    const int rating = it->second;
    rating_values[item] += rating;
}

// For each user starting from the second user:
for (int i=1; i < tot_users; ++i)
{
    double newnum = 0;
    double newden1 = 0;
    double newden2 = 0;

    const ItemRatings& ratings2 = user_data[i];
    for (auto it = ratings2.begin(); it != ratings2.end(); ++it)
    {
        const int item = it->first;
        const int rating1 = rating_values[it->first];
        if (rating != 0) {
            const int rating2 = it->second;
            double diff1 = rating2 - avg_map[1];
            double diff2 = rating1 - avg_map[i];
            newnum += diff1 * diff2;
            newden1 += pow(diff1, 2);
            newden2 += pow(diff2, 2);
        }
    }
    sim = newnum / (sqrt(newden1) * sqrt(newden2));
}

The biggest difference in the above code is that we eliminated all searches through unordered_map and replaced them with simple indexed access of an array. We also eliminated a lot of redundant copying of data structures.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM