简体   繁体   中英

matching high number of different sentences (using regexp patterns parsing)

I want to use regexps to build a text sentence classifier (for a chatbot natural language processing).

I have a very large number (eg >> 100 ) of different kind of text sentences to match regexps patterns.

When a sentence matches a regexp (say, an intent ), activates a specific action (a function handler).

I preset specific regexps to match any different different set of sentences, by example:

     // I have a long list of regexps (also many regexp for a many intents)

    const regexps = [ 
      /from (?<fromCity>.+)/,  // ---> actionOne()
      /to (?<toCity>.+)/,      // ---> actionTwo()
      /.../,                   // ---> anotherAction()
      /.../                   // ---> yetAnotherAction()
    ]

   // I have a long list of actions (function handlers)

   const actions = [
     actionOne(),
     actionTwo(),
     ...,
     ...
   ]      

How can I build the fastest (multi-regexp) classifier (in Javascript)?

My current quick and dirty solution is to just check each regexp sequentially:

    // at run time        
    ...
    sentence = 'from Genova'
    ...

    if (sentence.match(/from (?<fromCity>.+)/)
      actionOne()

    else if(sentence.match(/to (?<toCity>.+)/)
      actionTwo()

    else if ...
    else if ...
    else 
      fallback()

The above if-then sequence approach is not much scalable and above all is slow in term of performances (even if most frequency-used regexp sort could help).

An alternative approach to improve performances could be: to create a single (big) regexp composed by named group (one for each matcher-regexp) alternation ?

As in the minimal example:

   const regexp = /(?<one>from (?<toCity>.+))|(?<two>to (?<toCity>.+))/

So I create the regexp classifier simply with (please take the code here below as javascript pseudo-code):

    // at build time

    // I collect all possible regexps, each one as a named group
    const intents = [
      '(?<one>from (?<fromCity>.+))',
      '(?<two>to (?<toCity>.+))',
      '...',
      '...'
    ]

    const classifier = new RegExp(intents.join('|'))

    // collection of functions handlers, one for each regexp
    const Actions = {
     'one': 'actionOne',
     'two': 'actionTwo',
     ...,
     ...
    }

    // at run time

    const match = sentence.match(classifier)

    // if match, call corresponding function handler
    // match.groups contains the matching named group
    const action = Actions[match.groups]

    if ( action )
      action()
    else
      fallback() // no match

Does it make sense? Any suggestion for a better approach?

It very likely depends on quite a few things like each individual RegExp (eg how many capture groups), the actual size of the list and the length of your input.

But when testing on a very large amount of RegExp (10000 simple ones), any variation of the big combined RegExp is very much slower than just executing the individual ones one by one. JSPerf

Given that information, and the fact that it overall makes the code simpler, I would suggest to not go for that big RegExp approach.

To make things more easily maintainable, I would suggest storing each trigger and its action in the same place, for example an Array of Objects. This would also let you add more to these objects later if needed (for example naming the intent):

const intents = [
    { regexp: /from (?<fromCity>.+)/, action: fromCity },
    { regexp: /to (?<toCity>.+)/, action: toCity },
    { regexp: /.../, action: anotherAction },
];

// We use find to stop as soon as we've got a result
let result = intents.find(intent => {
    let match = sentence.match(intent.regexp);
    if (match) {
        // You can include a default action in case the action is not specified in the intent object
        // Decide what you send to your action function here
        (match.action || defaultAction)(match, sentence, intent);
    }
    return match;
});
if (!result) {
    fallback();
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM