How to reduce time complexity under c++ with nested loops and regex?

Question

I have such function.

Input argument - vector of user names, vector of strings, number of top users.

First I count amount of occurancies for each user in strings. If there are several occurancies in one string - it still counts as 1.

Then I sort it by amount of occurancies. If amount of occurancies are equal - sort alphabetically user names.

And function return top N users with the most occurancy.

std::vector<std::string> GetTopUsers(const std::vector<std::string>& users,
    const std::vector<std::string>& lines, const int topUsersNum) {
    std::vector<std::pair<std::string, int>> userOccurancies;

    //count user occurancies
    for (const auto & user : users) {
        int count = 0;
        for (const auto &line : lines) {
            std::regex rgx("\\b" + user + "\\b", std::regex::icase);
            std::smatch match;
            if (std::regex_search(line, match, rgx)) {
                ++count;
                auto userIter = std::find_if(userOccurancies.begin(), userOccurancies.end(),
                    [&user](const std::pair<std::string, int>& element) { return element.first == user; });
                if (userIter == userOccurancies.end()) {
                    userOccurancies.push_back(std::make_pair(user, count));
                }
                else {
                    userIter->second = count;
                }
            }
        }
    }

    //sort by amount of occurancies, if occurancies are equal - sort alphabetically
    std::sort(userOccurancies.begin(), userOccurancies.end(),
        [](const std::pair<std::string, int>& p1, const std::pair<std::string, int>& p2)
    { return (p1.second > p2.second) ? true : (p1.second == p2.second ? p1.first < p2.first : false); });

    //extract top N users
    int topUsersSz = (topUsersNum <= userOccurancies.size() ? topUsersNum : userOccurancies.size());
    std::vector<std::string> topUsers(topUsersSz);
    for (int i = 0; i < topUsersSz; i++) {
        topUsers.push_back(userOccurancies[i].first);
    }

    return topUsers;
}

So for the input

    std::vector<std::string> users = { "john", "atest", "qwe" };
    std::vector<std::string> lines = { "atest john", "Qwe", "qwe1", "qwe," };

    int topUsersNum = 4;

output will be qwe atest john

But it looks very complex. O(n^2) for loops + regex inside. It must be O(n^3) or even more.

Can you give me please advices how to make it with less complexity in c++11?

And also give me advices about code.

Or maybe there are better board for questions about complexity and performance?

Thank you.

UDP

   std::vector<std::string> GetTopUsers2(const std::vector<std::string>& users,
    const std::vector<std::string>& lines, const size_t topUsersNum) {
    std::vector<std::pair<std::string, int>> userOccurancies(users.size());

    auto userOcIt = userOccurancies.begin();
    for (const auto & user : users) {
        userOcIt->first = std::move(user);
        userOcIt->second = 0;
        userOcIt++;
    }

    //count user occurancies
    for (auto &user: userOccurancies) {
        int count = 0;
        std::regex rgx("\\b" + user.first + "\\b", std::regex::icase);
        std::smatch match;
        for (const auto &line : lines) {
            if (std::regex_search(line, match, rgx)) {
                ++count;
                user.second = count;
            }
        }
    }

    //sort by amount of occurancies, if occurancies are equal - sort alphabetically
    std::sort(userOccurancies.begin(), userOccurancies.end(),
        [](const std::pair<std::string, int>& p1, const std::pair<std::string, int>& p2)
    { return (p1.second > p2.second) ? true : (p1.second == p2.second ? p1.first < p2.first : false); });

    //extract top N users
    auto middle = userOccurancies.begin() + std::min(topUsersNum, userOccurancies.size());
    int topUsersSz = (topUsersNum <= userOccurancies.size() ? topUsersNum : userOccurancies.size());
    std::vector<std::string> topUsers(topUsersSz);
    auto topIter = topUsers.begin();
    for (auto iter = userOccurancies.begin(); iter != middle; iter++) {
        *topIter = std::move(iter->first);
        topIter++;
    }

    return topUsers;
}

Thanks to @Jarod42. I updated first part. I think that allocate memory to vector once at constructor is faster than call emplace_back every time, so I used it. If I am wrong - mark me.

Also I use c++11, not c++17.

time results:

Old: 3539400.00000 nanoseconds
New: 2674000.00000 nanoseconds

It is better but still looks complex, isn't it?

Answer 1

constructing regex is costly, and can be moved outside the loop:

also you might move string instead of copy.

You don't need to sort all range. std::partial_sort is enough.

And more important, you might avoid the inner find_if .

std::vector<std::string>
GetTopUsers(
    std::vector<std::string> users,
    const std::vector<std::string>& lines,
    int topUsersNum)
{
    std::vector<std::pair<std::string, std::size_t> userCount;
    userCount.reserve(users.size());

    for (auto& user : users) {
        userCount.emplace_back(std::move(user), 0);
    }

    for (auto& [user, count] : userCount) {
        std::regex rgx("\\b" + user + "\\b", std::regex::icase);
        for (const auto &line : lines) {
            std::smatch match;
            if (std::regex_search(line, match, rgx)) {
                ++count;
            }
        }
    }

    //sort by amount of occurancies, if occurancies are equal - sort alphabetically
    auto middle = userCount.begin() + std::min(topUsersNum, userCount.size());
    std::partial_sort(userCount.begin(),
                      middle,
                      userCount.end(),
                      [](const auto& lhs, const auto& rhs)
        {
            return std::tie(rhs.second, lhs.first) < std::tie(lhs.second, rhs.first);
        });

    //extract top N users
    std::vector<std::string> topUsers;
    topUsers.reserve(std::distance(userCount.begin(), middle));
    for (auto it = userCount.begin(); it != middle; ++it) {
        topUsers.push_back(std::move(it->first));
    }
    return topUsers;
}

Answer 2

i'm no professional coder, but i've made your code a bit faster (~90% faster, unless my math is wrong or i timed it wrong).

what it does is, it goes trough each of the lines, and for each line it counts the number of occurences for each user given. if the number of occurences for the current user are larger than the previous one, it moves the user at the beginning of the vector.

#include <iostream>
#include <Windows.h>
#include <vector>
#include <string>
#include <regex>
#include <algorithm>
#include <chrono>

std::vector<std::string> GetTopUsers(const std::vector<std::string>& users,
    const std::vector<std::string>& lines, const int topUsersNum) {
    std::vector<std::pair<std::string, int>> userOccurancies;

    //count user occurancies
    for (const auto & user : users) {
        int count = 0;
        for (const auto &line : lines) {
            std::regex rgx("\\b" + user + "\\b", std::regex::icase);
            std::smatch match;
            if (std::regex_search(line, match, rgx)) {
                ++count;
                auto userIter = std::find_if(userOccurancies.begin(), userOccurancies.end(),
                    [&user](const std::pair<std::string, int>& element) { return element.first == user; });
                if (userIter == userOccurancies.end()) {
                    userOccurancies.push_back(std::make_pair(user, count));
                }
                else {
                    userIter->second = count;
                }
            }
        }
    }

    //sort by amount of occurancies, if occurancies are equal - sort alphabetically
    std::sort(userOccurancies.begin(), userOccurancies.end(),
        [](const std::pair<std::string, int>& p1, const std::pair<std::string, int>& p2)
    { return (p1.second > p2.second) ? true : (p1.second == p2.second ? p1.first < p2.first : false); });

    //extract top N users
    int topUsersSz = (topUsersNum <= userOccurancies.size() ? topUsersNum : userOccurancies.size());
    std::vector<std::string> topUsers(topUsersSz);
    for (int i = 0; i < topUsersSz; i++) {
        topUsers.push_back(userOccurancies[i].first);
    }

    return topUsers;
}

unsigned int count_user_occurences(
    std::string & line,
    std::string & user
)
{
    unsigned int occur                  = {};
    std::string::size_type curr_index   = {};

    // while we can find the name of the user in the line, and we have not reached the end of the line
    while((curr_index = line.find(user, curr_index)) != std::string::npos)
    {
        // increase the number of occurences
        ++occur;
        // increase string index to skip the current user
        curr_index += user.length();
    }

    // return the number of occurences
    return occur;
}

std::vector<std::string> get_top_users(
    std::vector<std::string> & user_list,
    std::vector<std::string> & line_list
)
{
    // create vector to hold results
    std::vector<std::string> top_users = {};

    // put all of the users inside the "top_users" vector
    top_users = user_list;

    // make sure none of the vectors are empty
    if(false == user_list.empty()
        && false == line_list.empty())
    {
        // go trough each one of the lines
        for(unsigned int i = {}; i < line_list.size(); ++i)
        {
            // holds the number of occurences for the previous user
            unsigned int last_user_occur = {};

            // go trough each one of the users (we copied the list into "top_users")
            for(unsigned int j = {}; j < top_users.size(); ++j)
            {
                // get the number of the current user in the current line
                unsigned int curr_user_occur = count_user_occurences(line_list.at(i), top_users.at(j));
                // user temporary name holder
                std::string temp_user = {};

                // if the number of occurences of the current user is larger than the one of the previous user, move it at the top
                if(curr_user_occur >= last_user_occur)
                {
                    // save the current user's name
                    temp_user = top_users.at(j);

                    // erase the user from its current position
                    top_users.erase(top_users.begin() + j);

                    // move the user at the beginning of the vector
                    top_users.insert(top_users.begin(), temp_user);
                }

                // save the occurences of the current user to compare further users
                last_user_occur = curr_user_occur;
            }
        }
    }

    // return the top user vector
    return top_users;
}

int main()
{
    std::vector<std::string> users = { "john", "atest", "qwe" };
    std::vector<std::string> lines = { "atest john", "Qwe", "qwel", "qwe," };

    // time the first function
    auto start = std::chrono::high_resolution_clock::now();
    std::vector<std::string> top_users = get_top_users(users, lines);   
    auto stop = std::chrono::high_resolution_clock::now();
    // save the time in milliseconds
    double time = std::chrono::duration_cast<std::chrono::nanoseconds>(stop - start).count();

    // print time
    printf("%.05f nanoseconds\n", time);

    // time the second function
    auto start2 = std::chrono::high_resolution_clock::now();    
    std::vector<std::string> top_users2 = GetTopUsers(users, lines, 4);
    auto stop2 = std::chrono::high_resolution_clock::now();
    // save the time in milliseconds
    double time2 = std::chrono::duration_cast<std::chrono::nanoseconds>(stop2 - start2).count();

    // print time
    printf("%.05f nanoseconds", time2);

    getchar();

    return 0;
}

results (for my PC at least, they're pretty consistent across multiple runs):

366800.00000 nanoseconds
4235900.00000 nanoseconds

How to reduce time complexity under c++ with nested loops and regex?

Question

2 answers

solution1
1 2020-05-18 20:13:11

solution2
0 2020-05-18 19:34:43

How to reduce time complexity under c++ with nested loops and regex?

Question

2 answers

solution1 1 2020-05-18 20:13:11

solution2 0 2020-05-18 19:34:43

solution1
1 2020-05-18 20:13:11

solution2
0 2020-05-18 19:34:43