简体   繁体   中英

Check if each word is existing in database

Issue

I need to check if each word of a string is spelled correctly by searching a mongoDB collection for each word.

  1. Doing a minimum amount of DB query
  2. First word of each sentence must be in upper case, but this word could be upper or lower case in the dictionary. So I need a case sensitive match for each word. Only the first word of each sentence should be case in sensitive.

Sample string

This is a simple example. Example. This is another example.

Dictionary structure

Assume there is a dictionary collection like this

{ word: 'this' },
{ word: 'is' },
{ word: 'a' },
{ word: 'example' },
{ word: 'Name' }

In my case, there are 100.000 words in this dictionary. Of course names are stored in upper case, verbs are stored lower case and so on...

Expected result

The words simple and another should be recognized as 'misspelled' word as they are not existing in the DB.

An array with all existing words should be in this case: ['This', 'is', 'a', 'example'] . This is upper case as it is the first word of a sentence; in DB it is stored as lower case this .

My attempt so far (Updated)

const   sentences   = string.replace(/([.?!])\s*(?= [A-Z])/g, '$1|').split('|');
let     search      = [],
        words       = [],
        existing,
        missing;

sentences.forEach(sentence => {
    const   w   = sentence.trim().replace(/[^a-zA-Z0-9äöüÄÖÜß ]/gi, '').split(' ');

    w.forEach((word, index) => {
        const regex = new RegExp(['^', word, '$'].join(''), index === 0 ? 'i' : '');
        search.push(regex);
        words.push(word);
    });
});

existing = Dictionary.find({
    word: { $in: search }
}).map(obj => obj.word);

missing = _.difference(words, existing);

Problem

  1. The insensitive matches don't work properly: /^Example$/i will give me a result. But in existing there will go the original lowercase example , that means Example will go to missing -Array. So the case insensitive search is working as expected, but the result arrays have a missmatch. I don't know how to solve this.
  2. Optimizing the code possible? As I'm using two forEach -loops and a difference ...

This is how I would face this issue:

  • Use regex to get each word after space (including '.') in an array.

     var words = para.match(/(.+?)(\\b)/g); //this expression is not perfect but will work 
  • Now add all words from your collection in an array by using find(). Lets say name of that array is wordsOfColl .

  • Now check if words are in the way you want or not

     var prevWord= ""; //to check first word of sentence words.forEach(function(word) { if(wordsOfColl.toLowerCase().indexOf(word.toLowerCase()) !== -1) { if(prevWord.replace(/\\s/g, '') === '.') { //this is first word of sentence if(word[0] !== word[0].toUpperCase()) { //not capital, so generate error } } prevWord = word; } else { //not in collection, generate error } }); 

I haven't tested it so please let me know in comments if there's some issue. Or some requirement of yours I missed.

Update

As author of question suggested that he don't want to load whole collection on client, you can create a method on server which returns an array of words instead of giving access to client of collection.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM