简体   繁体   中英

Efficiently checking for substrings and replacing them - can I improve performance here?

I need to examine millions of strings for abbreviations and replace them with the full version. Due to the data, only abbreviations terminated by a comma should be replaced. Strings can contain multiple abbreviations.

I have a lookup table that contains Abbreviation->Fullversion pairs, it contains about 600 pairs.

My current setup looks like something this. On startup I create a list of ShortForm instances from a csv file using Jackson and hold them in a singleton:

public static class ShortForm{
    public String fullword;
    public String abbreviation;
}

List<ShortForm> shortForms = new ArrayList<ShortForm>();
//csv code ommited

And some code that uses the list

for (ShortForm f: shortForms){
    if (address.contains(f.abbreviation+","))
        address = address.replace(f.abbreviation+",", f.fullword+",");
}

Now this works, but it's slow . Is there a way I can speed it up? The first step is to load the ShortForm objects with commas in place, but what else could I do?

====== UPDATE Changed code to work the other way around. Splits strings into words and checks a set to see if the string is an abbreviation.

    StringBuilder fullFormed = new StringBuilder();
    for (String s: Splitter.on(" ").split(add)){
        if (shortFormMap.containsKey(s))
            fullFormed.append(shortFormMap.get(s));
        else
            fullFormed.append(s);
        fullFormed.append(" ");
    }

    return fullFormed.toString().trim();

Testing shows this to be over 13x faster that the original approach. Cheers davecom!

如果您跳过contains()部分,将会早一些:)

What could really improve performance would be to use a better data structure than a simple array for storing your ShortForms. All of the shortForms could be stored sorted alphabetically by abbreviation. You could therefore reduce the lookup time from O(N) to something looking more like a binary search.

I haven't used it before, but perhaps the standard library's SortedMap fits the bill instead of using a custom object at all: http://docs.oracle.com/javase/7/docs/api/java/util/SortedMap.html

Here's what I'm thinking:

  • Put abbreviation/full word pairs into TreeMap
  • Tokenize the address into words.
  • Check each word to see if it is a key in the TreeMap
  • Replace it if it is
  • Put the corrected tokens back together as an address

I think I'd do this with a HashMap. The key would be the abbreviation and the value would be the full term. Then just search through a string for a comma and see if the text that precedes the comma is in the dictionary. You could probably map all the replacements in a single string in one pass and then make all the replacements after that.

This makes each lookup O(1) for a total of O(n) lookups where n is the number of abbreviations found and I don't think there's likely a more efficient method.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM