简体   繁体   中英

JavaScript string tokenizer regex with placeholders

I have a tokenizer function that takes a string, a regex pattern for split and a arbitrary list of regex patterns to be protected from tokenization. To achieve that I'm using placeholder ____SSS____ to avoid those patterns to get split:

function tokenize(str,default_pattern,protected_patterns) {
       const screen = new RegExp('(?:' + protected_patterns.map(s => '(?:' + s + ')').join('|') + ')', "gi");
       var screened = [];
       str = str.replace(screen, s => {
       var i = screened.push(s) - 1;
       return '____SSS____' + i + '____SSS____'; // chose a non-separator as screener, so that these placeholders don't get split.
      });
      res = str.split(default_pattern).map(s => s.replace(/____SSS____(\d+)____SSS____/, (_, i) => screened[i]))
      return res;
    }

By example, if I want to prevent that the pattern yo-ho to get split, I will do:

tokenize("Podia ser yo-ho, mi amor ahora ya acabó", /[^a-zA-Zá-úÁ-ÚñÑüÜ____SSS____(\d+)____SSS____]+/i, ["\\byo-ho\\b"])
(8) ["Podia", "ser", "yo-ho", "mi", "amor", "ahora", "ya", "acabó"]

Of course I have to add the placeholder format ____SSS____(\d+)____SSS___ in the regex, otherwise the split takes place:

patterns("Podia ser yo-ho, mi amor ahora ya acabó", /[^a-zA-Zá-úÁ-ÚñÑüÜ]+/i, ["\\byo-ho\\b"])
(9) ["Podia", "ser", "SSS", "SSS", "mi", "amor", "ahora", "ya", "acabó"]

Now, for different languages I may have different split rules like

{
    "es" : /[^a-zA-Zá-úÁ-ÚñÑüÜ]+/,
    "fr" : /[^a-z0-9äâàéèëêïîöôùüûœç]+/i
}

and I would like to dynamically add the ____SSS____(\d+)____SSS___ to each of them, but I do not find the right way to obtain this, so that the result should look like:

 {
      "es" : /[^a-zA-Zá-úÁ-ÚñÑüÜ____SSS____(\d+)____SSS___]+/,
      "fr" :  /[^a-z0-9äâàéèëêïîöôùüûœç____SSS____(\d+)____SSS___]+/i
 }

that will make the tokenizer with protected patterns to work properly.

You can simply capture the existing split rule like this:
(.+)(\].*)
and append your placeholder in-between the first and second capture group.

https://regex101.com/r/QCFnLS/1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM