简体   繁体   English

Java 一次替换字符串中的多个不同的 substring(或以最有效的方式)

[英]Java Replacing multiple different substring in a string at once (or in the most efficient way)

I need to replace many different sub-string in a string in the most efficient way.我需要以最有效的方式替换字符串中的许多不同子字符串。 is there another way other then the brute force way of replacing each field using string.replace?除了使用 string.replace 替换每个字段的蛮力方式之外,还有另一种方式吗?

If the string you are operating on is very long, or you are operating on many strings, then it could be worthwhile using a java.util.regex.Matcher (this requires time up-front to compile, so it won't be efficient if your input is very small or your search pattern changes frequently).如果您正在操作的字符串很长,或者您正在操作许多字符串,那么使用 java.util.regex.Matcher 可能是值得的(这需要预先编译时间,因此效率不高如果您的输入非常小或您的搜索模式经常更改)。

Below is a full example, based on a list of tokens taken from a map.下面是一个完整的示例,基于从地图中获取的令牌列表。 (Uses StringUtils from Apache Commons Lang). (使用来自 Apache Commons Lang 的 StringUtils)。

Map<String,String> tokens = new HashMap<String,String>();
tokens.put("cat", "Garfield");
tokens.put("beverage", "coffee");

String template = "%cat% really needs some %beverage%.";

// Create pattern of the format "%(cat|beverage)%"
String patternString = "%(" + StringUtils.join(tokens.keySet(), "|") + ")%";
Pattern pattern = Pattern.compile(patternString);
Matcher matcher = pattern.matcher(template);

StringBuffer sb = new StringBuffer();
while(matcher.find()) {
    matcher.appendReplacement(sb, tokens.get(matcher.group(1)));
}
matcher.appendTail(sb);

System.out.println(sb.toString());

Once the regular expression is compiled, scanning the input string is generally very quick (although if your regular expression is complex or involves backtracking then you would still need to benchmark in order to confirm this!)编译正则表达式后,扫描输入字符串通常非常快(尽管如果您的正则表达式很复杂或涉及回溯,那么您仍然需要进行基准测试以确认这一点!)

Algorithm算法

One of the most efficient ways to replace matching strings (without regular expressions) is to use the Aho-Corasick algorithm with a performantTrie (pronounced "try"), fast hashing algorithm, and efficient collections implementation.替换匹配字符串(没有正则表达式)的最有效方法之一是使用Aho-Corasick 算法和高性能Trie (发音为“try”)、快速散列算法和高效的集合实现。

Simple Code简单代码

A simple solution leverages Apache's StringUtils.replaceEach as follows:一个简单的解决方案利用 Apache 的StringUtils.replaceEach如下:

  private String testStringUtils(
    final String text, final Map<String, String> definitions ) {
    final String[] keys = keys( definitions );
    final String[] values = values( definitions );

    return StringUtils.replaceEach( text, keys, values );
  }

This slows down on large texts.这会减慢大文本的速度。

Fast Code快速代码

Bor's implementation of the Aho-Corasick algorithm introduces a bit more complexity that becomes an implementation detail by using a façade with the same method signature: Bor对 Aho-Corasick 算法的实现引入了更多的复杂性,通过使用具有相同方法签名的外观成为实现细节:

  private String testBorAhoCorasick(
    final String text, final Map<String, String> definitions ) {
    // Create a buffer sufficiently large that re-allocations are minimized.
    final StringBuilder sb = new StringBuilder( text.length() << 1 );

    final TrieBuilder builder = Trie.builder();
    builder.onlyWholeWords();
    builder.removeOverlaps();

    final String[] keys = keys( definitions );

    for( final String key : keys ) {
      builder.addKeyword( key );
    }

    final Trie trie = builder.build();
    final Collection<Emit> emits = trie.parseText( text );

    int prevIndex = 0;

    for( final Emit emit : emits ) {
      final int matchIndex = emit.getStart();

      sb.append( text.substring( prevIndex, matchIndex ) );
      sb.append( definitions.get( emit.getKeyword() ) );
      prevIndex = emit.getEnd() + 1;
    }

    // Add the remainder of the string (contains no more matches).
    sb.append( text.substring( prevIndex ) );

    return sb.toString();
  }

Benchmarks基准

For the benchmarks, the buffer was created using randomNumeric as follows:对于基准测试,缓冲区是使用randomNumeric创建的,如下所示:

  private final static int TEXT_SIZE = 1000;
  private final static int MATCHES_DIVISOR = 10;

  private final static StringBuilder SOURCE
    = new StringBuilder( randomNumeric( TEXT_SIZE ) );

Where MATCHES_DIVISOR dictates the number of variables to inject:其中MATCHES_DIVISOR指示要注入的变量数量:

  private void injectVariables( final Map<String, String> definitions ) {
    for( int i = (SOURCE.length() / MATCHES_DIVISOR) + 1; i > 0; i-- ) {
      final int r = current().nextInt( 1, SOURCE.length() );
      SOURCE.insert( r, randomKey( definitions ) );
    }
  }

The benchmark code itself ( JMH seemed overkill):基准代码本身( JMH似乎有点矫枉过正):

long duration = System.nanoTime();
final String result = testBorAhoCorasick( text, definitions );
duration = System.nanoTime() - duration;
System.out.println( elapsed( duration ) );

1,000,000 : 1,000 1,000,000 : 1,000

A simple micro-benchmark with 1,000,000 characters and 1,000 randomly-placed strings to replace.一个简单的微基准测试,包含 1,000,000 个字符和 1,000 个要替换的随机放置的字符串。

  • testStringUtils: 25 seconds, 25533 millis testStringUtils: 25 秒,25533 毫秒
  • testBorAhoCorasick: 0 seconds, 68 millis testBorAhoCorasick: 0 秒,68 毫秒

No contest.没有比赛。

10,000 : 1,000 10,000 : 1,000

Using 10,000 characters and 1,000 matching strings to replace:使用 10,000 个字符和 1,000 个匹配字符串来替换:

  • testStringUtils: 1 seconds, 1402 millis testStringUtils: 1 秒,1402 毫秒
  • testBorAhoCorasick: 0 seconds, 37 millis testBorAhoCorasick: 0 秒,37 毫秒

The divide closes.鸿沟关闭。

1,000 : 10 1,000 : 10

Using 1,000 characters and 10 matching strings to replace:使用 1,000 个字符和 10 个匹配的字符串来替换:

  • testStringUtils: 0 seconds, 7 millis testStringUtils: 0 秒,7 毫秒
  • testBorAhoCorasick: 0 seconds, 19 millis testBorAhoCorasick: 0 秒,19 毫秒

For short strings, the overhead of setting up Aho-Corasick eclipses the brute-force approach by StringUtils.replaceEach .对于短字符串,设置 Aho-Corasick 的开销超过了StringUtils.replaceEach的蛮力方法。

A hybrid approach based on text length is possible, to get the best of both implementations.基于文本长度的混合方法是可能的,以获得两种实现的最佳效果。

Implementations实现

Consider comparing other implementations for text longer than 1 MB, including:考虑比较长度超过 1 MB 的文本的其他实现,包括:

Papers文件

Papers and information relating to the algorithm:与算法相关的论文和资料:

This worked for me:这对我有用:

String result = input.replaceAll("string1|string2|string3","replacementString");

Example:例子:

String input = "applemangobananaarefruits";
String result = input.replaceAll("mango|are|ts","-");
System.out.println(result);

Output: apple-banana-frui-输出: apple-banana-fruit-

If you are going to be changing a String many times, then it is usually more efficient to use a StringBuilder (but measure your performance to find out) :如果您要多次更改字符串,那么使用 StringBuilder 通常更有效(但要测量您的性能以找出答案)

String str = "The rain in Spain falls mainly on the plain";
StringBuilder sb = new StringBuilder(str);
// do your replacing in sb - although you'll find this trickier than simply using String
String newStr = sb.toString();

Every time you do a replace on a String, a new String object is created, because Strings are immutable.每次对 String 进行替换时,都会创建一个新的 String 对象,因为 String 是不可变的。 StringBuilder is mutable, that is, it can be changed as much as you want. StringBuilder 是可变的,也就是说,它可以随心所欲地更改。

StringBuilder will perform replace more efficiently, since its character array buffer can be specified to a required length. StringBuilder将更有效地执行替换,因为它的字符数组缓冲区可以指定为所需的长度。 StringBuilder is designed for more than appending! StringBuilder不仅仅是为了追加而设计的!

Of course the real question is whether this is an optimisation too far ?当然,真正的问题是这是否是一种优化过度? The JVM is very good at handling creation of multiple objects and the subsequent garbage collection, and like all optimisation questions, my first question is whether you've measured this and determined that it's a problem. JVM 非常擅长处理多个对象的创建和后续的垃圾收集,并且像所有优化问题一样,我的第一个问题是您是否已经对此进行了测量并确定这是一个问题。

Check this:检查这个:

String.format(str,STR[])

For instance:例如:

String.format( "Put your %s where your %s is", "money", "mouth" );

Rythm a java template engine now released with an new feature called String interpolation mode which allows you do something like: Rythm 是一个 Java 模板引擎,现在发布了一个名为String 插值模式的新功能,它允许您执行以下操作:

String result = Rythm.render("@name is inviting you", "Diana");

The above case shows you can pass argument to template by position.上面的例子表明您可以按位置将参数传递给模板。 Rythm also allows you to pass arguments by name: Rhythm 还允许您按名称传递参数:

Map<String, Object> args = new HashMap<String, Object>();
args.put("title", "Mr.");
args.put("name", "John");
String result = Rythm.render("Hello @title @name", args);

Note Rythm is VERY FAST, about 2 to 3 times faster than String.format and velocity, because it compiles the template into java byte code, the runtime performance is very close to concatentation with StringBuilder.注意Rythm 非常快,大约比String.format 和velocity 快2 到3 倍,因为它将模板编译成java 字节码,运行时性能非常接近与StringBuilder 的拼接。

Links:链接:

The below is based on Todd Owen's answer .以下内容基于Todd Owen 的回答 That solution has the problem that if the replacements contain characters that have special meaning in regular expressions, you can get unexpected results.该解决方案存在的问题是,如果替换包含在正则表达式中具有特殊含义的字符,您可能会得到意想不到的结果。 I also wanted to be able to optionally do a case-insensitive search.我还希望能够选择性地进行不区分大小写的搜索。 Here is what I came up with:这是我想出的:

/**
 * Performs simultaneous search/replace of multiple strings. Case Sensitive!
 */
public String replaceMultiple(String target, Map<String, String> replacements) {
  return replaceMultiple(target, replacements, true);
}

/**
 * Performs simultaneous search/replace of multiple strings.
 * 
 * @param target        string to perform replacements on.
 * @param replacements  map where key represents value to search for, and value represents replacem
 * @param caseSensitive whether or not the search is case-sensitive.
 * @return replaced string
 */
public String replaceMultiple(String target, Map<String, String> replacements, boolean caseSensitive) {
  if(target == null || "".equals(target) || replacements == null || replacements.size() == 0)
    return target;

  //if we are doing case-insensitive replacements, we need to make the map case-insensitive--make a new map with all-lower-case keys
  if(!caseSensitive) {
    Map<String, String> altReplacements = new HashMap<String, String>(replacements.size());
    for(String key : replacements.keySet())
      altReplacements.put(key.toLowerCase(), replacements.get(key));

    replacements = altReplacements;
  }

  StringBuilder patternString = new StringBuilder();
  if(!caseSensitive)
    patternString.append("(?i)");

  patternString.append('(');
  boolean first = true;
  for(String key : replacements.keySet()) {
    if(first)
      first = false;
    else
      patternString.append('|');

    patternString.append(Pattern.quote(key));
  }
  patternString.append(')');

  Pattern pattern = Pattern.compile(patternString.toString());
  Matcher matcher = pattern.matcher(target);

  StringBuffer res = new StringBuffer();
  while(matcher.find()) {
    String match = matcher.group(1);
    if(!caseSensitive)
      match = match.toLowerCase();
    matcher.appendReplacement(res, replacements.get(match));
  }
  matcher.appendTail(res);

  return res.toString();
}

Here are my unit test cases:这是我的单元测试用例:

@Test
public void replaceMultipleTest() {
  assertNull(ExtStringUtils.replaceMultiple(null, null));
  assertNull(ExtStringUtils.replaceMultiple(null, Collections.<String, String>emptyMap()));
  assertEquals("", ExtStringUtils.replaceMultiple("", null));
  assertEquals("", ExtStringUtils.replaceMultiple("", Collections.<String, String>emptyMap()));

  assertEquals("folks, we are not sane anymore. with me, i promise you, we will burn in flames", ExtStringUtils.replaceMultiple("folks, we are not winning anymore. with me, i promise you, we will win big league", makeMap("win big league", "burn in flames", "winning", "sane")));

  assertEquals("bcaacbbcaacb", ExtStringUtils.replaceMultiple("abccbaabccba", makeMap("a", "b", "b", "c", "c", "a")));
  assertEquals("bcaCBAbcCCBb", ExtStringUtils.replaceMultiple("abcCBAabCCBa", makeMap("a", "b", "b", "c", "c", "a")));
  assertEquals("bcaacbbcaacb", ExtStringUtils.replaceMultiple("abcCBAabCCBa", makeMap("a", "b", "b", "c", "c", "a"), false));

  assertEquals("c colon  backslash temp backslash  star  dot  star ", ExtStringUtils.replaceMultiple("c:\\temp\\*.*", makeMap(".", " dot ", ":", " colon ", "\\", " backslash ", "*", " star "), false));
}

private Map<String, String> makeMap(String ... vals) {
  Map<String, String> map = new HashMap<String, String>(vals.length / 2);
  for(int i = 1; i < vals.length; i+= 2)
    map.put(vals[i-1], vals[i]);
  return map;
}

使用replaceAll()方法怎么样?

public String replace(String input, Map<String, String> pairs) {
  // Reverse lexic-order of keys is good enough for most cases,
  // as it puts longer words before their prefixes ("tool" before "too").
  // However, there are corner cases, which this algorithm doesn't handle
  // no matter what order of keys you choose, eg. it fails to match "edit"
  // before "bed" in "..bedit.." because "bed" appears first in the input,
  // but "edit" may be the desired longer match. Depends which you prefer.
  final Map<String, String> sorted = 
      new TreeMap<String, String>(Collections.reverseOrder());
  sorted.putAll(pairs);
  final String[] keys = sorted.keySet().toArray(new String[sorted.size()]);
  final String[] vals = sorted.values().toArray(new String[sorted.size()]);
  final int lo = 0, hi = input.length();
  final StringBuilder result = new StringBuilder();
  int s = lo;
  for (int i = s; i < hi; i++) {
    for (int p = 0; p < keys.length; p++) {
      if (input.regionMatches(i, keys[p], 0, keys[p].length())) {
        /* TODO: check for "edit", if this is "bed" in "..bedit.." case,
         * i.e. look ahead for all prioritized/longer keys starting within
         * the current match region; iff found, then ignore match ("bed")
         * and continue search (find "edit" later), else handle match. */
        // if (better-match-overlaps-right-ahead)
        //   continue;
        result.append(input, s, i).append(vals[p]);
        i += keys[p].length();
        s = i--;
      }
    }
  }
  if (s == lo) // no matches? no changes!
    return input;
  return result.append(input, s, hi).toString();
}

Summary: Single class implementation of Dave's answer, to automatically choose the most efficient of the two algorithms.总结:戴夫答案的单类实现,自动选择两种算法中最有效的。

This is a full, single class implementation based on the above excellent answer from Dave Jarvis .这是一个完整的单类实现,基于Dave Jarvis的上述优秀答案 The class automatically chooses between the two different supplied algorithms, for maximum efficiency.该类会自动在提供的两种不同算法之间进行选择,以实现最高效率。 (This answer is for people who would just like to quickly copy and paste.) (此答案适用于只想快速复制和粘贴的人。)

ReplaceStrings class:替换字符串类:

package somepackage

import java.util.ArrayList;
import java.util.Collection;
import java.util.HashMap;
import java.util.Map;
import java.util.Set;
import org.ahocorasick.trie.Emit;
import org.ahocorasick.trie.Trie;
import org.ahocorasick.trie.Trie.TrieBuilder;
import org.apache.commons.lang3.StringUtils;

/**
 * ReplaceStrings, This class is used to replace multiple strings in a section of text, with high
 * time efficiency. The chosen algorithms were adapted from: https://stackoverflow.com/a/40836618
 */
public final class ReplaceStrings {

    /**
     * replace, This replaces multiple strings in a section of text, according to the supplied
     * search and replace definitions. For maximum efficiency, this will automatically choose
     * between two possible replacement algorithms.
     *
     * Performance note: If it is known in advance that the source text is long, then this method
     * signature has a very small additional performance advantage over the other method signature.
     * (Although either method signature will still choose the best algorithm.)
     */
    public static String replace(
        final String sourceText, final Map<String, String> searchReplaceDefinitions) {
        final boolean useLongAlgorithm
            = (sourceText.length() > 1000 || searchReplaceDefinitions.size() > 25);
        if (useLongAlgorithm) {
            // No parameter adaptations are needed for the long algorithm.
            return replaceUsing_AhoCorasickAlgorithm(sourceText, searchReplaceDefinitions);
        } else {
            // Create search and replace arrays, which are needed by the short algorithm.
            final ArrayList<String> searchList = new ArrayList<>();
            final ArrayList<String> replaceList = new ArrayList<>();
            final Set<Map.Entry<String, String>> allEntries = searchReplaceDefinitions.entrySet();
            for (Map.Entry<String, String> entry : allEntries) {
                searchList.add(entry.getKey());
                replaceList.add(entry.getValue());
            }
            return replaceUsing_StringUtilsAlgorithm(sourceText, searchList, replaceList);
        }
    }

    /**
     * replace, This replaces multiple strings in a section of text, according to the supplied
     * search strings and replacement strings. For maximum efficiency, this will automatically
     * choose between two possible replacement algorithms.
     *
     * Performance note: If it is known in advance that the source text is short, then this method
     * signature has a very small additional performance advantage over the other method signature.
     * (Although either method signature will still choose the best algorithm.)
     */
    public static String replace(final String sourceText,
        final ArrayList<String> searchList, final ArrayList<String> replacementList) {
        if (searchList.size() != replacementList.size()) {
            throw new RuntimeException("ReplaceStrings.replace(), "
                + "The search list and the replacement list must be the same size.");
        }
        final boolean useLongAlgorithm = (sourceText.length() > 1000 || searchList.size() > 25);
        if (useLongAlgorithm) {
            // Create a definitions map, which is needed by the long algorithm.
            HashMap<String, String> definitions = new HashMap<>();
            final int searchListLength = searchList.size();
            for (int index = 0; index < searchListLength; ++index) {
                definitions.put(searchList.get(index), replacementList.get(index));
            }
            return replaceUsing_AhoCorasickAlgorithm(sourceText, definitions);
        } else {
            // No parameter adaptations are needed for the short algorithm.
            return replaceUsing_StringUtilsAlgorithm(sourceText, searchList, replacementList);
        }
    }

    /**
     * replaceUsing_StringUtilsAlgorithm, This is a string replacement algorithm that is most
     * efficient for sourceText under 1000 characters, and less than 25 search strings.
     */
    private static String replaceUsing_StringUtilsAlgorithm(final String sourceText,
        final ArrayList<String> searchList, final ArrayList<String> replacementList) {
        final String[] searchArray = searchList.toArray(new String[]{});
        final String[] replacementArray = replacementList.toArray(new String[]{});
        return StringUtils.replaceEach(sourceText, searchArray, replacementArray);
    }

    /**
     * replaceUsing_AhoCorasickAlgorithm, This is a string replacement algorithm that is most
     * efficient for sourceText over 1000 characters, or large lists of search strings.
     */
    private static String replaceUsing_AhoCorasickAlgorithm(final String sourceText,
        final Map<String, String> searchReplaceDefinitions) {
        // Create a buffer sufficiently large that re-allocations are minimized.
        final StringBuilder sb = new StringBuilder(sourceText.length() << 1);
        final TrieBuilder builder = Trie.builder();
        builder.onlyWholeWords();
        builder.ignoreOverlaps();
        for (final String key : searchReplaceDefinitions.keySet()) {
            builder.addKeyword(key);
        }
        final Trie trie = builder.build();
        final Collection<Emit> emits = trie.parseText(sourceText);
        int prevIndex = 0;
        for (final Emit emit : emits) {
            final int matchIndex = emit.getStart();

            sb.append(sourceText.substring(prevIndex, matchIndex));
            sb.append(searchReplaceDefinitions.get(emit.getKeyword()));
            prevIndex = emit.getEnd() + 1;
        }
        // Add the remainder of the string (contains no more matches).
        sb.append(sourceText.substring(prevIndex));
        return sb.toString();
    }

    /**
     * main, This contains some test and example code.
     */
    public static void main(String[] args) {
        String shortSource = "The quick brown fox jumped over something. ";
        StringBuilder longSourceBuilder = new StringBuilder();
        for (int i = 0; i < 50; ++i) {
            longSourceBuilder.append(shortSource);
        }
        String longSource = longSourceBuilder.toString();
        HashMap<String, String> searchReplaceMap = new HashMap<>();
        ArrayList<String> searchList = new ArrayList<>();
        ArrayList<String> replaceList = new ArrayList<>();
        searchReplaceMap.put("fox", "grasshopper");
        searchReplaceMap.put("something", "the mountain");
        searchList.add("fox");
        replaceList.add("grasshopper");
        searchList.add("something");
        replaceList.add("the mountain");
        String shortResultUsingArrays = replace(shortSource, searchList, replaceList);
        String shortResultUsingMap = replace(shortSource, searchReplaceMap);
        String longResultUsingArrays = replace(longSource, searchList, replaceList);
        String longResultUsingMap = replace(longSource, searchReplaceMap);
        System.out.println(shortResultUsingArrays);
        System.out.println("----------------------------------------------");
        System.out.println(shortResultUsingMap);
        System.out.println("----------------------------------------------");
        System.out.println(longResultUsingArrays);
        System.out.println("----------------------------------------------");
        System.out.println(longResultUsingMap);
        System.out.println("----------------------------------------------");
    }
}

Needed Maven dependencies:需要的 Maven 依赖项:

(Add these to your pom file if needed.) (如果需要,将这些添加到您的 pom 文件中。)

    <!-- Apache Commons utilities. Super commonly used utilities.
    https://mvnrepository.com/artifact/org.apache.commons/commons-lang3 -->
    <dependency>
        <groupId>org.apache.commons</groupId>
        <artifactId>commons-lang3</artifactId>
        <version>3.10</version>
    </dependency>

    <!-- ahocorasick, An algorithm used for efficient searching and 
    replacing of multiple strings.
    https://mvnrepository.com/artifact/org.ahocorasick/ahocorasick -->
    <dependency>
        <groupId>org.ahocorasick</groupId>
        <artifactId>ahocorasick</artifactId>
        <version>0.4.0</version>
    </dependency>
import java.util.*;
import java.io.*;

public class Main {
    public static void main(String[] args) throws IOException, Exception {

        // this program might help you with your problem
        // if not, I still hope you could get some ideas out of this
        
        Scanner userInput = new Scanner(System.in),
        fileLocation = new Scanner(new File(userInput.nextLine())); // enter your .txt, .java etc.. file local directory

        String search = userInput.nextLine().trim(), // the word or line you want to replace
               replace = userInput.nextLine().trim(); // the word or line replacement

        String newFile = ""; // this will be the template for your edited file

        LinkedList<String> lineOfWords = new LinkedList<String>(); // every line of words or sentences will be stored in here

        for (int index = 0; fileLocation.hasNextLine(); index++) {
            lineOfWords.add(fileLocation.nextLine().replaceAll(search, replace)); // it will edit the line if there was a match before storing it in the list
            newFile = newFile.concat(lineOfWords.get(index)) + "\n"; // this will create an edited file
        }
        FileWriter saveNewFile = new FileWriter(userInput.nextLine()); // enter the local directory where you want your new file to be saved
        saveNewFile.write(newFile); // finally, the saving method
        saveNewFile.close(); // closing all these are necessary
        fileLocation.close(); // closing all these are necessary
        userInput.close(); // closing all these are necessarry
    }
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM