简体   繁体   English

查找字符串中最常见的字符对

[英]Find most common pair of characters in a string

i have written the following function 我写了以下功能

//O(n^2)
void MostCommonPair(char * cArr , char * ch1 , char * ch2 , int * amount)
{
    int count , max = 0;
    char cCurrent , cCurrent2;
    int i = 0 , j;
    while(*(cArr + i + 1) != '\0')
    {
        cCurrent = *(cArr + i);
        cCurrent2 = *(cArr + i + 1);
        for(j = i , count = 0 ; *(cArr + j + 1) != '\0' ; j++)
        {
            if(cCurrent ==  *(cArr + j) && cCurrent2 ==  *(cArr + j + 1))
            {
                count++;
            }
        }
        if(count > max)
        {
            *ch1 = cCurrent;
            *ch2 = cCurrent2;
            max = *amount = count;
        }
        i++;
    }
}

for the following input 用于以下输入

"xdshahaalohalobscxbsbsbs" “xdshahaalohalobscxbsbsbs”

ch1 = b ch2 = s amount = 4 ch1 = b ch2 = s amount = 4

but in my opinion the function is very un efficient , is there a way to go through the string only once or to reduce the run size to O(n)? 但在我看来,该功能非常无效,有没有办法只通过字符串一次或将运行大小减少到O(n)?

Since char can hold up to 256 values, you can set up a two-dimensional table of [256*256] counters, run through your string once, incrementing the counter that corresponds to each pair of character in the string. 由于char最多可以容纳256个值,因此您可以设置[256 * 256]计数器的二维表,在字符串中运行一次,递增与字符串中每对字符对应的计数器。 Then you can go through the table of 256x256 numbers, pick the largest count, and know to what pair it belongs by looking at its position in the 2D array. 然后你可以浏览256x256数字表,选择最大数量,并通过查看它在2D数组中的位置来了解它所属的对。 Since the size of the counter table is fixed to a constant value independent of the length of the string, that operation is O(1) , even though it requires two nested loops. 由于计数器表的大小固定为与字符串长度无关的常量值,因此该操作为O(1) ,即使它需要两个嵌套循环。

int count[256][256];
memset(count, 0, sizeof(count));
const char *str = "xdshahaalohalobscxbsbsbs";
for (const char *p = str ; *(p+1) ; p++) {
    count[(int)*p][(int)*(p+1)]++;
}
int bestA = 0, bestB = 0;
for (int i = 0 ; i != 256 ; i++) {
    for (int j = 0 ; j != 256 ; j++) {
        if (count[i][j] > count[bestA][bestB]) {
            bestA = i;
            bestB = j;
        }
    }
}
printf("'%c%c' : %d times\n", bestA, bestB, count[bestA][bestB]);

Here is a link to a demo on ideone . 这是一个关于ideone的演示链接

Keep in mind that although this is the fastest possible solution asymptotically (ie it's O(N) , and you cannot make it faster than O(N) ) the performance is not going to be good for shorter strings. 请记住,尽管这是渐近最快的解决方案(即它是O(N) ,并且你不能使它比O(N)更快),但性能对于较短的字符串来说并不好。 In fact, your solution will beat it hands-down on inputs shorter than approximately 256 characters, probably even more. 事实上,您的解决方案将在短于大约256个字符的输入上击败它,甚至可能更多。 There is a number of optimizations that you can apply to this code, but I decided against adding them on to keep the main idea of the code clearly visible in its purest and simplest form. 您可以对此代码应用许多优化,但我决定不添加它们以保持代码的主要概念以最纯粹和最简单的形式清晰可见。

If you want O(n) runtime you can use a hashtable (For example, Java's HashMap ) 如果你想要O(n)运行时你可以使用哈希表 (例如,Java的HashMap

  • Iterate through your string exactly once, 1 character at a time O(n) 通过你的字符串迭代一次,每次1个字符O(n)
  • For each character visited, look ahead by exactly 1 more character (This is thus your character pair - just concatenate them) O(1) 对于每个访问过的角色,请向前看另外一个角色(这是你的角色对 - 只是连接它们) O(1)
  • For each such character pair found, first look for it in the hashtable: O(1) 对于找到的每个这样的字符对,首先在哈希表中查找它: O(1)
    • If it's not in the hashtable yet, add it in with the character pair as the key, and int 1 as the value (this counts the number of times you've seen it in the string). 如果它不在哈希表中,则将其添加为字符对作为键,并将int 1作为值(这将计算您在字符串中看到它的次数)。 O(1) O(1)
    • If it's already in the hashtable, increment its value O(1) 如果它已经在哈希表中,则递增其值O(1)
  • After you are done looking through the string, check the hashtable for the pair with the highest count. 查看完字符串后,检查具有最高计数的对的哈希表。 O(m) (where m is the number of possible pairings; m <= n necessarily) O(m) (其中m是可能的配对数; m <= n必然)

Yes, you can do this in approximately linear time by keeping a running count. 是的,您可以通过保持运行计数在近似线性时间内完成此操作。

Does that help? 这有帮助吗?

Assuming by most "common pair" you mean the most common set of two sequential characters 假设大多数“普通对”是指最常见的两个连续字符集


At pseudo-code level you want to 在您想要的伪代码级别

 Read the first character into the "second character" register
 while(there is data)
    store the old second character as the new first character
    read the next character as the second one
    increment the count associated with this pair
 Select the most common pair

So what you need is an efficient algorythm for storing and counts associated with character pairs and finding the most common one. 因此,您需要的是一个有效的算法,用于存储和计算与字符对相关的计数并找到最常见的算法。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM