简体   繁体   English

有效地检查并替换子字符串-我可以在这里提高性能吗?

[英]Efficiently checking for substrings and replacing them - can I improve performance here?

I need to examine millions of strings for abbreviations and replace them with the full version. 我需要检查数百万个字符串的缩写并将其替换为完整版本。 Due to the data, only abbreviations terminated by a comma should be replaced. 根据数据,仅应替换以逗号结尾的缩写。 Strings can contain multiple abbreviations. 字符串可以包含多个缩写。

I have a lookup table that contains Abbreviation->Fullversion pairs, it contains about 600 pairs. 我有一个包含Abbreviation-> Fullversion对的查找表,它包含约600个对。

My current setup looks like something this. 我当前的设置看起来像这样。 On startup I create a list of ShortForm instances from a csv file using Jackson and hold them in a singleton: 在启动时,我使用Jackson来从csv文件中创建ShortForm实例列表,并将它们放在单例中:

public static class ShortForm{
    public String fullword;
    public String abbreviation;
}

List<ShortForm> shortForms = new ArrayList<ShortForm>();
//csv code ommited

And some code that uses the list 还有一些使用列表的代码

for (ShortForm f: shortForms){
    if (address.contains(f.abbreviation+","))
        address = address.replace(f.abbreviation+",", f.fullword+",");
}

Now this works, but it's slow . 现在可以用,但是很 Is there a way I can speed it up? 有什么办法可以加快速度吗? The first step is to load the ShortForm objects with commas in place, but what else could I do? 第一步是使用适当的逗号加载ShortForm对象,但是我还能做什么?

====== UPDATE Changed code to work the other way around. ====== 更新已更改代码以另一种方式工作。 Splits strings into words and checks a set to see if the string is an abbreviation. 将字符串拆分为单词,然后检查集合以查看字符串是否为缩写。

    StringBuilder fullFormed = new StringBuilder();
    for (String s: Splitter.on(" ").split(add)){
        if (shortFormMap.containsKey(s))
            fullFormed.append(shortFormMap.get(s));
        else
            fullFormed.append(s);
        fullFormed.append(" ");
    }

    return fullFormed.toString().trim();

Testing shows this to be over 13x faster that the original approach. 测试表明,此方法比原始方法快13倍以上。 Cheers davecom! 干杯davecom!

如果您跳过contains()部分,将会早一些:)

What could really improve performance would be to use a better data structure than a simple array for storing your ShortForms. 真正可以提高性能的是,使用比简单数组更好的数据结构来存储您的ShortForms。 All of the shortForms could be stored sorted alphabetically by abbreviation. 所有shortForm都可以按缩写字母顺序存储。 You could therefore reduce the lookup time from O(N) to something looking more like a binary search. 因此,您可以将查找时间从O(N)减少到更像二进制搜索的时间。

I haven't used it before, but perhaps the standard library's SortedMap fits the bill instead of using a custom object at all: http://docs.oracle.com/javase/7/docs/api/java/util/SortedMap.html 我以前从未使用过它,但是也许标准库的SortedMap符合要求,而不是完全使用自定义对象: http//docs.oracle.com/javase/7/docs/api/java/util/SortedMap。 HTML

Here's what I'm thinking: 这就是我的想法:

  • Put abbreviation/full word pairs into TreeMap 将缩写词/完整词对放入TreeMap中
  • Tokenize the address into words. 将地址标记为单词。
  • Check each word to see if it is a key in the TreeMap 检查每个单词,看看它是否是TreeMap中的键
  • Replace it if it is 如果是,请更换
  • Put the corrected tokens back together as an address 将更正的令牌放在一起作为地址

I think I'd do this with a HashMap. 我想我可以使用HashMap做到这一点。 The key would be the abbreviation and the value would be the full term. 关键是缩写,值是完整的术语。 Then just search through a string for a comma and see if the text that precedes the comma is in the dictionary. 然后,只需在字符串中搜索逗号,然后查看逗号前面的文本是否在字典中。 You could probably map all the replacements in a single string in one pass and then make all the replacements after that. 您可以一次将所有替换项映射到单个字符串中,然后再进行所有替换。

This makes each lookup O(1) for a total of O(n) lookups where n is the number of abbreviations found and I don't think there's likely a more efficient method. 这使得每次查找O(1)总共进行O(n)个查找,其中n是找到的缩写的数量,我认为没有可能找到更有效的方法。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM