简体   繁体   中英

Fastest way to lookup a String value

I have a simple application that reads data in small strings from large text files and saves them to a database. To actually save each such String, the application calls the following method several (may thousands, or more) times:

setValue(String value)
{
    if (!ignore(value))
    {
         // Save the value in the database
    }
}

Currently, I implement the ignore() method by just successively comparing a set of Strings, eg

public boolean ignore(String value)
{
    if (value.equalsIgnoreCase("Value 1") || (value.equalsIgnoreCase("Value 2"))
    {
        return true;
    }

    return false;
}

However, because I need to check against many such "ignorable" values, which will be defined in another part of the code, I need to use a data structure for this check, instead of multiple consecutive if statements.

So, my question is, what would be the fastest data structure from standard Java to to implement this? A HashMap? A Set? Something else?

Initialization time is not an issue, since it will happen statically and once per application invocation.

EDIT: The solutions suggested thus far (including HashSet) appear slower than just using a String[] with all the ignored words and just running "equalsIgnoreCase" against each of these.

Use a HashSet , storing the values in lowercase, and its contains() method, which has better lookup performance than TreeSet (constant-time versus log-time for contains).

Set<String> ignored = new HashSet<String>();
ignored.add("value 1"); // store in lowercase
ignored.add("value 2"); // store in lowercase

public boolean ignore(String value) {
    return ignored.contains(value.toLowerCase());    
}

Storing the values in lowercase and searching for the lowercased input avoids the hassle of dealing with case during comparison, so you get the full speed of the HashSet implementation and zero collection-related code to write (eg Collator, Comparator etc).

EDITED
Thanks to Jon Skeet for pointing out that certain Turkish characters behave oddly when calling toLowerCase() , but if you're not intending on supporting Turkish input (or perhaps other languages with non-standard case issues) then this approach will work well for you.

In most cases I'd normally start with a HashSet<String> - but as you want case-insensitivity, that makes it slightly harder.

You can try using a TreeSet<Object> using an appropriate Collator for case-insensitivity. For example:

Collator collator = Collator.getInstance(Locale.US);
collator.setStrength(Collator.SECONDARY);

TreeSet<Object> set = new TreeSet<Object>(collator);

Note that you can't create a TreeSet<String> as Collator only implements Comparator<Object> .

EDIT: While the above version works with just strings, it may be faster to create a TreeSet<CollationKey> :

Collator collator = Collator.getInstance(Locale.US);
collator.setStrength(Collator.SECONDARY);

TreeSet<CollationKey> set = new TreeSet<CollationKey>();
for (String value : valuesToIgnore) {
    set.add(collator.getCollationKey(value));
}

Then:

public boolean ignore(String value)
{
    return set.contains(collator.getCollationKey(value));
}

It would be nice to have a way of storing the collation keys for all ignored values but then avoid creating new collation keys when testing, but I don't know of a way of doing that.

If using Java 7 this is a fast way to do it:

public boolean ignore(String value) {
  switch(value.toLowerCase()) { // see comment Jon Skeet
    case "lowercased_ignore_value1":
    case "lowercased_ignore_value2":
      // etc
      return true;
    default:
      return false;
  }
}

Add the words to ignore to a list and just check if the word is in that list.

That makes it dynamically.

It seems that String[] is slightly better (performance-wise) than the other methods proposed, so I will use that.

It is simply something like this:

public boolean ignore(String value)
{
    for (String ignore:IGNORED_VALUES)
    {
        if (ignore.equalsIgnoreCase(value))
        {
            return true;
        }

        return false;
    }

The IGNORED_VALUES object is just a String[] with all ignored values in there.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM