简体   繁体   English

如何使用嵌套循环和正则表达式降低 c++ 下的时间复杂度?

[英]How to reduce time complexity under c++ with nested loops and regex?

I have such function.我有这样的function。

Input argument - vector of user names, vector of strings, number of top users.输入参数 - 用户名向量、字符串向量、顶级用户数。

First I count amount of occurancies for each user in strings.首先,我计算字符串中每个用户的出现次数。 If there are several occurancies in one string - it still counts as 1.如果一个字符串中有多次出现 - 它仍然计为 1。

Then I sort it by amount of occurancies.然后我按出现次数对其进行排序。 If amount of occurancies are equal - sort alphabetically user names.如果出现次数相等 - 按字母顺序对用户名进行排序。

And function return top N users with the most occurancy.而 function 返回出现次数最多的前 N 个用户。

std::vector<std::string> GetTopUsers(const std::vector<std::string>& users,
    const std::vector<std::string>& lines, const int topUsersNum) {
    std::vector<std::pair<std::string, int>> userOccurancies;

    //count user occurancies
    for (const auto & user : users) {
        int count = 0;
        for (const auto &line : lines) {
            std::regex rgx("\\b" + user + "\\b", std::regex::icase);
            std::smatch match;
            if (std::regex_search(line, match, rgx)) {
                ++count;
                auto userIter = std::find_if(userOccurancies.begin(), userOccurancies.end(),
                    [&user](const std::pair<std::string, int>& element) { return element.first == user; });
                if (userIter == userOccurancies.end()) {
                    userOccurancies.push_back(std::make_pair(user, count));
                }
                else {
                    userIter->second = count;
                }
            }
        }
    }

    //sort by amount of occurancies, if occurancies are equal - sort alphabetically
    std::sort(userOccurancies.begin(), userOccurancies.end(),
        [](const std::pair<std::string, int>& p1, const std::pair<std::string, int>& p2)
    { return (p1.second > p2.second) ? true : (p1.second == p2.second ? p1.first < p2.first : false); });

    //extract top N users
    int topUsersSz = (topUsersNum <= userOccurancies.size() ? topUsersNum : userOccurancies.size());
    std::vector<std::string> topUsers(topUsersSz);
    for (int i = 0; i < topUsersSz; i++) {
        topUsers.push_back(userOccurancies[i].first);
    }

    return topUsers;
}

So for the input所以对于输入

    std::vector<std::string> users = { "john", "atest", "qwe" };
    std::vector<std::string> lines = { "atest john", "Qwe", "qwe1", "qwe," };

    int topUsersNum = 4;

output will be qwe atest john output 将是qwe atest john

But it looks very complex.但它看起来非常复杂。 O(n^2) for loops + regex inside. O(n^2) for 内部循环 + 正则表达式。 It must be O(n^3) or even more.它必须是 O(n^3) 甚至更多。

Can you give me please advices how to make it with less complexity in c++11?您能否给我一些建议,如何在 c++11 中降低复杂性?

And also give me advices about code.并给我关于代码的建议。

Or maybe there are better board for questions about complexity and performance?或者也许有更好的板来解决有关复杂性和性能的问题?

Thank you.谢谢你。

UDP UDP

   std::vector<std::string> GetTopUsers2(const std::vector<std::string>& users,
    const std::vector<std::string>& lines, const size_t topUsersNum) {
    std::vector<std::pair<std::string, int>> userOccurancies(users.size());

    auto userOcIt = userOccurancies.begin();
    for (const auto & user : users) {
        userOcIt->first = std::move(user);
        userOcIt->second = 0;
        userOcIt++;
    }

    //count user occurancies
    for (auto &user: userOccurancies) {
        int count = 0;
        std::regex rgx("\\b" + user.first + "\\b", std::regex::icase);
        std::smatch match;
        for (const auto &line : lines) {
            if (std::regex_search(line, match, rgx)) {
                ++count;
                user.second = count;
            }
        }
    }

    //sort by amount of occurancies, if occurancies are equal - sort alphabetically
    std::sort(userOccurancies.begin(), userOccurancies.end(),
        [](const std::pair<std::string, int>& p1, const std::pair<std::string, int>& p2)
    { return (p1.second > p2.second) ? true : (p1.second == p2.second ? p1.first < p2.first : false); });

    //extract top N users
    auto middle = userOccurancies.begin() + std::min(topUsersNum, userOccurancies.size());
    int topUsersSz = (topUsersNum <= userOccurancies.size() ? topUsersNum : userOccurancies.size());
    std::vector<std::string> topUsers(topUsersSz);
    auto topIter = topUsers.begin();
    for (auto iter = userOccurancies.begin(); iter != middle; iter++) {
        *topIter = std::move(iter->first);
        topIter++;
    }

    return topUsers;
}

Thanks to @Jarod42.感谢@Jarod42。 I updated first part.我更新了第一部分。 I think that allocate memory to vector once at constructor is faster than call emplace_back every time, so I used it.我认为在构造函数中将 memory 分配给向量一次比每次调用emplace_back ,所以我使用了它。 If I am wrong - mark me.如果我错了 - 标记我。

Also I use c++11, not c++17.我也使用 c++11,而不是 c++17。

time results:时间结果:

Old: 3539400.00000 nanoseconds
New: 2674000.00000 nanoseconds

It is better but still looks complex, isn't it?它更好,但看起来仍然很复杂,不是吗?

constructing regex is costly, and can be moved outside the loop:构造正则表达式的成本很高,并且可以移到循环之外:

also you might move string instead of copy.你也可以移动字符串而不是复制。

You don't need to sort all range.您不需要对所有范围进行排序。 std::partial_sort is enough. std::partial_sort就足够了。

And more important, you might avoid the inner find_if .更重要的是,您可能会避免使用内部find_if

std::vector<std::string>
GetTopUsers(
    std::vector<std::string> users,
    const std::vector<std::string>& lines,
    int topUsersNum)
{
    std::vector<std::pair<std::string, std::size_t> userCount;
    userCount.reserve(users.size());

    for (auto& user : users) {
        userCount.emplace_back(std::move(user), 0);
    }

    for (auto& [user, count] : userCount) {
        std::regex rgx("\\b" + user + "\\b", std::regex::icase);
        for (const auto &line : lines) {
            std::smatch match;
            if (std::regex_search(line, match, rgx)) {
                ++count;
            }
        }
    }

    //sort by amount of occurancies, if occurancies are equal - sort alphabetically
    auto middle = userCount.begin() + std::min(topUsersNum, userCount.size());
    std::partial_sort(userCount.begin(),
                      middle,
                      userCount.end(),
                      [](const auto& lhs, const auto& rhs)
        {
            return std::tie(rhs.second, lhs.first) < std::tie(lhs.second, rhs.first);
        });

    //extract top N users
    std::vector<std::string> topUsers;
    topUsers.reserve(std::distance(userCount.begin(), middle));
    for (auto it = userCount.begin(); it != middle; ++it) {
        topUsers.push_back(std::move(it->first));
    }
    return topUsers;
}

i'm no professional coder, but i've made your code a bit faster (~90% faster, unless my math is wrong or i timed it wrong).我不是专业的编码员,但我已经让你的代码更快了(大约快 90%,除非我的数学错误或者我计时错误)。

what it does is, it goes trough each of the lines, and for each line it counts the number of occurences for each user given.它的作用是遍历每一行,并为每一行计算给定每个用户的出现次数。 if the number of occurences for the current user are larger than the previous one, it moves the user at the beginning of the vector.如果当前用户的出现次数大于前一个用户,则将用户移动到向量的开头。

#include <iostream>
#include <Windows.h>
#include <vector>
#include <string>
#include <regex>
#include <algorithm>
#include <chrono>

std::vector<std::string> GetTopUsers(const std::vector<std::string>& users,
    const std::vector<std::string>& lines, const int topUsersNum) {
    std::vector<std::pair<std::string, int>> userOccurancies;

    //count user occurancies
    for (const auto & user : users) {
        int count = 0;
        for (const auto &line : lines) {
            std::regex rgx("\\b" + user + "\\b", std::regex::icase);
            std::smatch match;
            if (std::regex_search(line, match, rgx)) {
                ++count;
                auto userIter = std::find_if(userOccurancies.begin(), userOccurancies.end(),
                    [&user](const std::pair<std::string, int>& element) { return element.first == user; });
                if (userIter == userOccurancies.end()) {
                    userOccurancies.push_back(std::make_pair(user, count));
                }
                else {
                    userIter->second = count;
                }
            }
        }
    }

    //sort by amount of occurancies, if occurancies are equal - sort alphabetically
    std::sort(userOccurancies.begin(), userOccurancies.end(),
        [](const std::pair<std::string, int>& p1, const std::pair<std::string, int>& p2)
    { return (p1.second > p2.second) ? true : (p1.second == p2.second ? p1.first < p2.first : false); });

    //extract top N users
    int topUsersSz = (topUsersNum <= userOccurancies.size() ? topUsersNum : userOccurancies.size());
    std::vector<std::string> topUsers(topUsersSz);
    for (int i = 0; i < topUsersSz; i++) {
        topUsers.push_back(userOccurancies[i].first);
    }

    return topUsers;
}

unsigned int count_user_occurences(
    std::string & line,
    std::string & user
)
{
    unsigned int occur                  = {};
    std::string::size_type curr_index   = {};

    // while we can find the name of the user in the line, and we have not reached the end of the line
    while((curr_index = line.find(user, curr_index)) != std::string::npos)
    {
        // increase the number of occurences
        ++occur;
        // increase string index to skip the current user
        curr_index += user.length();
    }

    // return the number of occurences
    return occur;
}

std::vector<std::string> get_top_users(
    std::vector<std::string> & user_list,
    std::vector<std::string> & line_list
)
{
    // create vector to hold results
    std::vector<std::string> top_users = {};

    // put all of the users inside the "top_users" vector
    top_users = user_list;

    // make sure none of the vectors are empty
    if(false == user_list.empty()
        && false == line_list.empty())
    {
        // go trough each one of the lines
        for(unsigned int i = {}; i < line_list.size(); ++i)
        {
            // holds the number of occurences for the previous user
            unsigned int last_user_occur = {};

            // go trough each one of the users (we copied the list into "top_users")
            for(unsigned int j = {}; j < top_users.size(); ++j)
            {
                // get the number of the current user in the current line
                unsigned int curr_user_occur = count_user_occurences(line_list.at(i), top_users.at(j));
                // user temporary name holder
                std::string temp_user = {};

                // if the number of occurences of the current user is larger than the one of the previous user, move it at the top
                if(curr_user_occur >= last_user_occur)
                {
                    // save the current user's name
                    temp_user = top_users.at(j);

                    // erase the user from its current position
                    top_users.erase(top_users.begin() + j);

                    // move the user at the beginning of the vector
                    top_users.insert(top_users.begin(), temp_user);
                }

                // save the occurences of the current user to compare further users
                last_user_occur = curr_user_occur;
            }
        }
    }

    // return the top user vector
    return top_users;
}

int main()
{
    std::vector<std::string> users = { "john", "atest", "qwe" };
    std::vector<std::string> lines = { "atest john", "Qwe", "qwel", "qwe," };

    // time the first function
    auto start = std::chrono::high_resolution_clock::now();
    std::vector<std::string> top_users = get_top_users(users, lines);   
    auto stop = std::chrono::high_resolution_clock::now();
    // save the time in milliseconds
    double time = std::chrono::duration_cast<std::chrono::nanoseconds>(stop - start).count();

    // print time
    printf("%.05f nanoseconds\n", time);

    // time the second function
    auto start2 = std::chrono::high_resolution_clock::now();    
    std::vector<std::string> top_users2 = GetTopUsers(users, lines, 4);
    auto stop2 = std::chrono::high_resolution_clock::now();
    // save the time in milliseconds
    double time2 = std::chrono::duration_cast<std::chrono::nanoseconds>(stop2 - start2).count();

    // print time
    printf("%.05f nanoseconds", time2);

    getchar();

    return 0;
}

results (for my PC at least, they're pretty consistent across multiple runs):结果(至少对于我的 PC,它们在多次运行中非常一致):

366800.00000 nanoseconds
4235900.00000 nanoseconds

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM