简体   繁体   English

搜索String数组以获取子字符串并返回多个值的最有效方法?

[英]Most efficient way to search String array for substring and return multiple values?

Say you're given the array containing: 假设你得到的数组包含:

Jordan 约旦

Jord JORD

Anna 安娜

Rob

RobJord RobJord

And you want to return an array of all values that contain Jord (ie Jord, Jordan, RobJord), what is the most efficient way to do this. 并且您希望返回包含Jord的所有值的数组(即Jord,Jordan,RobJord),这是最有效的方法。

I'm using Java however I'm not allowed to use java.util Array functions. 我正在使用Java但是我不允许使用java.util数组函数。

This approach comes to my mind: 我想到了这种方法:

public ArrayList<String> search(String searchString, String[] names)
{
  ArrayList<String> searchList = new ArrayList<String>();

  for (String name : names)
  {
    if(name.contains(searchString))
    {
      searchList.add(name);
    }
  }

  return searchList;
}

Now to search, use this: 现在要搜索,使用这个:

String[] names = {"Jordan", "Jord", "Anna", "Rob", "RobJord"};
String searchString = "Jord";

ArrayList<String> filterList = search(searchString, names);

It doesn't use java.util.Arrays methods, and also gets the job done in a clean way, not to mention, its fast. 它不使用java.util.Arrays方法,并且还以干净的方式完成工作,更不用说,它的速度很快。

Now if you can't even use ArrayList , then you have two choices: 现在如果你甚至不能使用ArrayList ,那么你有两个选择:
1. Make your own implementation of ArrayList and use that. 1.制作自己的ArrayList实现并使用它。
2. Follow the following method: 2.遵循以下方法:

public String[] search(String searchString, String[] names)
{
  int size = getSize(searchString, names);
  String[] searchList = new String[size];

  int index = 0;
  for (String name : names)
  {
    if(name.contains(searchString))
    {
      searchList[index++] = name;
    }
  }

  return searchList;
}

// Returns appropriate size for the Search List
private int getSize(String searchString, String[] names)
{
  int size = 0;
  for (String name : names)
  {
    if(name.contains(searchString))
    {
      size++;
    }
  }

  return size;
}

Well, since this sounds like homework, it's for you to solve, but I would consider this very-English pseudo-code. 好吧,因为这听起来像家庭作业,它是要解决的,但我会考虑这个非常英国的伪代码。 It avoids the use of java.util.* (eg ArrayList or Arrays classes) and only uses primitive constructs. 它避免使用java.util.* (例如ArrayList或Arrays类)并且仅使用原始构造。

count = 0
for each item in the input
    if the rule matches
       increase count by 1

create output array of size count

target index = 0
for each item in the input
    if the rule matches
        add the item to the output array at the target index,
        and increase the target index by 1

return the output array

This code is O(n) in complexity , even though it loops through the input ( n ) twice because that's a constant factor, and O(2*n) is 2*O(n) is O(n) . 该代码的复杂度O(n) ,即使它通过输入( n )两次循环,因为它是常数因子,而O(2*n)2*O(n)O(n)

Now, the constant bounds could be slightly reduced by, instead of only counting on the first pass, also compacting the values on the first pass, and then only copying the compacted values, which would be less than or equal to n , to a new smaller array. 现在,常量边界可以稍微减少,而不是只计算第一遍,而不是压缩第一遍的值,然后只将压缩值(小于或等于n复制到新的较小的阵列。 It would still be O(n) , but it may have a slightly lower wall-clock time .. or it might perform worse depending on subtle cache/JIT/data factors. 它仍然是O(n) ,但它可能有一个稍低的挂钟时间..或者它可能会执行更糟糕的取决于微妙的缓存/ JIT /数据因素。 Oh, the fun intricacies of modern computers! 哦,现代电脑的复杂功能!

There is no trivial way to improve the O(n) "efficiency" bounds - and especially not for one run. 没有什么简单的方法来改善O(n) “效率”界限 - 尤其不是一次运行。

It would take a bit of code to get everything set up and it would be horrible style, but you could transfer your Strings into char arrays, and have an int array which represents the ascii values of the letters in "Jord", so that you gain the benefit of checking by primitive rather than object referencing. 设置所有内容需要一些代码,这将是一个可怕的风格,但你可以将你的字符串转换为字符数组,并有一个int数组,代表“Jord”中字母的ascii值,所以你获得通过原语而不是对象引用进行检查的好处。 Pass the chars you're checking against into a conditional block that evaluates it with the int values of 将您正在检查的字符传递给条件块,该块使用int值来计算它

'J', 'o', 'r', 'd' //74, 111, 114, 100

Again, I only suggest this craziness because you have so much emphasis on efficiency. 同样,我只是建议这种疯狂,因为你非常注重效率。 Right off the bat I'll say there's an efficiency drawback of the time it takes to transfer everything over to chars. 马上我会说,将所有东西转移到字符上都需要时间效率。 The benefit would be best seen in large processing tasks, such as checking for Jord in an entire 1000 page eBook because the initializing only happens once (or in large chunks I suppose with huge data perhaps, but still beneficial either way) 在大型处理任务中可以获得最大的好处,例如在整个1000页的电子书中检查Jord,因为初始化只发生一次(或者在大块中我认为可能包含大量数据,但无论哪种方式都有益)

//assuming its case sensitive: ascii values for 'J' 'o' 'r' 'd'
int[] charArr = new int[]{74, 111, 114, 100};

Again, it requires some setting up which hinders performance, plus its just weird, but it does give you the benefit of validating by primitive int. 同样,它需要一些设置,这会妨碍性能,加上它只是奇怪,但它确实为您提供了通过原始int验证的好处。

Another thought would be to consider the statistics of certain letters being followed by another letter. 另一个想法是考虑某些字母的统计数据后跟另一个字母。 For example, the likelihood of "J" being followed by any vowel is extremely high, and thus "J" being followed by "o" yet still not being "Jord" is therefore extremely high since we only have 5 vowels(plus y, that weird one...) You might get "Jork" for example and you've wasted checking "o" and "r". 例如,“J”跟随任何元音的可能性非常高,因此“J”后跟“o”但仍然不是“Jord”因此非常高,因为我们只有5个元音(加y,奇怪的一个...)你可能会得到“Jork”,你浪费了检查“o”和“r”。 So with that being said, perhaps it would be better to move the scanner up a few letters (or your current array index counter - whichever way you're iterating) to check for "d" after you've established a match for "J". 所以说到这一点,也许最好将扫描仪向上移动几个字母(或你当前的数组索引计数器 - 无论你迭代哪种方式),在你为“J”建立匹配后检查“d” ”。 I think that would increase the efficiency. 我认为这会提高效率。

Basically I'm saying if you construct it in such a way that it checks letter by letter in an iterating manner, Step one would be to match "J", and then step 2 would be to skip over "o" and check for "r" or "d" instead. 基本上我是说如果你以一种迭代方式逐字逐句检查的方式构造它,第一步是匹配“J”,然后第二步是跳过 “o”并检查“换句话说“r”或“d”。 Or in other words, find a candidate, and eliminate candidates aggressively 或者换句话说,找到一个候选人,并积极地消除候选人

EDIT: I'd actually say check for "d" in step 2 and don't consider checking "r" until step 3 if step 2 checks out because that way your code will be simpler - start at the start, move to the end, then iterate backwards to the start+1. 编辑:我实际上说在步骤2中检查“d”,如果步骤2检出,则不考虑检查“r”直到步骤3,因为这样你的代码会更简单 - 从头开始​​,移到最后,然后向后迭代到开始+ 1。 If you check for "r" in step 2 then step 3 and 4 will be zigzagging indices to traverse 如果在步骤2中检查“r”,则步骤3和4将是锯齿形索引以进行遍历

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM