简体   繁体   English

实现一个高效的算法来找到两个字符串的交集

[英]Implementing an efficent algorithm to find the intersection of two strings

Implement an algorithm that takes two strings as input, and returns the intersection of the two, with each letter represented at most once. 实现一个算法,该算法将两个字符串作为输入,并返回两者的交集,每个字母最多表示一次。

Algo: (considering language used will be c#) Algo :(考虑使用的语言将是c#)

  1. Convert both strings into char array 将两个字符串转换为char数组
  2. take the smaller array and generate a hash table for it with key as the character and value 0 获取较小的数组并为其生成哈希表,其中键为字符,值为0
  3. Now Loop through the other array and increment the count in hash table if that char is present in it. 现在循环遍历另一个数组并在散列表中增加计数(如果该字符存在于其中)。
  4. Now take out all char for hash table whose value is > 0. 现在取出值为> 0的哈希表的所有char。
  5. These are intersection values. 这些是交叉值。

This is an O(n), solution but is uses extra space, 2 char arrays and a hash table 这是一个O(n)解决方案但是使用了额外的空间,2个char数组和一个哈希表

Can you guys think of better solution than this? 你能想到比这更好的解决方案吗?

How about this ... 这个怎么样 ...

var s1 = "aabbccccddd";
var s2 = "aabc";

var ans = s1.Intersect(s2);

Haven't tested this, but here's my thought: 没有测试过这个,但这是我的想法:

  1. Quicksort both strings in place, so you have an ordered sequence of characters 将两个字符串定位到适当的位置,因此您有一个有序的字符序列
  2. Keeping an index into both strings, compare the "next" character from each string, pick and output the first one, incrementing the index for that string. 将索引保存到两个字符串中,比较每个字符串中的“下一个”字符,选择并输出第一个字符串,递增该字符串的索引。
  3. Continue until you get to the end of one of the strings, then just pull unique values from the rest of the remaining string. 继续,直到你到达其中一个字符串的末尾,然后从剩余的字符串中拉出唯一值。

Won't use additional memory, only needs the two original strings, two integers, and an output string (or StringBuilder). 不会使用额外的内存,只需要两个原始字符串,两个整数和一个输出字符串(或StringBuilder)。 As an added bonus, the output values will be sorted too! 作为额外的奖励,输出值也将被排序!

Part 2: This is what I'd write (sorry about the comments, new to stackoverflow): 第2部分:这是我写的(对于注释,对stackoverflow的新内容感到抱歉):

private static string intersect(string left, string right)
{
  StringBuilder theResult = new StringBuilder();

  string sortedLeft = Program.sort(left);
  string sortedRight = Program.sort(right);

  int leftIndex = 0;
  int rightIndex = 0;

  //  Work though the string with the "first last character".
  if (sortedLeft[sortedLeft.Length - 1] > sortedRight[sortedRight.Length - 1])
  {
    string temp = sortedLeft;
    sortedLeft = sortedRight;
    sortedRight = temp;
  }

  char lastChar = default(char);
  while (leftIndex < sortedLeft.Length)
  {
    char nextChar = (sortedLeft[leftIndex] <= sortedRight[rightIndex]) ? sortedLeft[leftIndex++] : sortedRight[rightIndex++];

    if (lastChar == nextChar) continue;

    theResult.Append(nextChar);
    lastChar = nextChar;
  }

  //  Add the remaining characters from the "right" string
  while (rightIndex < sortedRight.Length)
  {
    char nextChar = sortedRight[rightIndex++];
    if (lastChar == nextChar) continue;

    theResult.Append(nextChar);
    lastChar = nextChar;
  }
  theResult.Append(sortedRight, rightIndex, sortedRight.Length - rightIndex);

  return (theResult.ToString());
}

I hope that makes more sense. 我希望这更有意义。

You don't need to 2 char arrays. 您不需要2个char数组。 The System.String data type has a built-in indexer by position that returns the char from that position, so you could just loop through from 0 to (String.Length - 1). System.String数据类型有一个按位置的内置索引器,它从该位置返回char,因此您可以从0循环到(String.Length - 1)。 If you're more interested in speed than optimizing storage space, then you could make a HashSet for the one of the strings, then make a second HashSet which will contain your final result. 如果您对速度比对优化存储空间更感兴趣,那么您可以为其中一个字符串创建一个HashSet,然后创建一个包含最终结果的第二个HashSet。 Then you iterate through the second string, testing each char against the first HashSet, and if it exists then add it the second HashSet. 然后迭代遍历第二个字符串,针对第一个HashSet测试每个char,如果它存在则将其添加到第二个HashSet。 By the end, you already have a single HashSet with all the intersections, and save yourself the pass of running through the Hashtable looking for ones with a non-zero value. 最后,您已经拥有一个包含所有交叉点的HashSet,并且自己保存在Hashtable中运行的通道,以查找具有非零值的HashSet。

EDIT: I entered this before all the comments on the question about not wanting to use any built-in containers at all 编辑:我在关于不想使用任何内置容器的问题的所有评论之前输入了这个

here's how I would do this. 这是我怎么做的。 It's still O(N) and it doesn't use a hash table but instead one int array of length 26. (ideally) 它仍然是O(N)并且它不使用哈希表,而是使用长度为26的一个int数组。(理想情况下)

  1. make an array of 26 integers, each element for a letter of the alphebet. 制作一个由26个整数组成的数组,每个元素都是alphebet的一个字母。 init to 0's. 初始化为0。
  2. iterate over the first string, decrementing one when a letter is encountered. 迭代第一个字符串,在遇到字母时递减一个字符串。
  3. iterate over the second string and take the absolute of whatever is at the index corresponding to any letter you encounter. 迭代第二个字符串并获取与您遇到的任何字母对应的索引处的绝对值。 (edit: thanks to scwagner in comments) (编辑:感谢评论中的scwagner)
  4. return all letters corresponding to all indexes holding value greater than 0. 返回与保存值大于0的所有索引对应的所有字母。

still O(N) and extra space of only 26 ints. 还有O(N)和额外的空间只有26个整数。

of course if you're not limited to only lower or uppercase characters your array size may need to change. 当然,如果您不仅限于低位或大写字符,则可能需要更改数组大小。

"with each letter represented at most once" “每个字母最多代表一次”

I'm assuming that this means you just need to know the intersections, and not how many times they occurred. 我假设这意味着你只需要知道交叉点,而不是它们发生了多少次。 If that's so then you can trim down your algorithm by making use of yield . 如果是这样,那么你可以通过利用yield减少你的算法。 Instead of storing the count and continuing to iterate the second string looking for additional matches, you can yield the intersection right there and continue to the next possible match from the first string. 而不是存储计数并继续迭代第二个字符串以寻找其他匹配,您可以在那里产生交集,并继续从第一个字符串开始下一个可能的匹配。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM