Issue
I need to check if each word of a string is spelled correctly by searching a mongoDB collection for each word.
Sample string
This is a simple example. Example. This is another example.
Dictionary structure
Assume there is a dictionary collection like this
{ word: 'this' },
{ word: 'is' },
{ word: 'a' },
{ word: 'example' },
{ word: 'Name' }
In my case, there are 100.000 words in this dictionary. Of course names are stored in upper case, verbs are stored lower case and so on...
Expected result
The words simple
and another
should be recognized as 'misspelled' word as they are not existing in the DB.
An array with all existing words should be in this case: ['This', 'is', 'a', 'example']
. This
is upper case as it is the first word of a sentence; in DB it is stored as lower case this
.
My attempt so far (Updated)
const sentences = string.replace(/([.?!])\s*(?= [A-Z])/g, '$1|').split('|');
let search = [],
words = [],
existing,
missing;
sentences.forEach(sentence => {
const w = sentence.trim().replace(/[^a-zA-Z0-9äöüÄÖÜß ]/gi, '').split(' ');
w.forEach((word, index) => {
const regex = new RegExp(['^', word, '$'].join(''), index === 0 ? 'i' : '');
search.push(regex);
words.push(word);
});
});
existing = Dictionary.find({
word: { $in: search }
}).map(obj => obj.word);
missing = _.difference(words, existing);
Problem
/^Example$/i
will give me a result. But in existing
there will go the original lowercase example
, that means Example
will go to missing
-Array. So the case insensitive search is working as expected, but the result arrays have a missmatch. I don't know how to solve this. forEach
-loops and a difference
... This is how I would face this issue:
Use regex to get each word after space (including '.') in an array.
var words = para.match(/(.+?)(\\b)/g); //this expression is not perfect but will work
Now add all words from your collection in an array by using find(). Lets say name of that array is wordsOfColl .
Now check if words are in the way you want or not
var prevWord= ""; //to check first word of sentence words.forEach(function(word) { if(wordsOfColl.toLowerCase().indexOf(word.toLowerCase()) !== -1) { if(prevWord.replace(/\\s/g, '') === '.') { //this is first word of sentence if(word[0] !== word[0].toUpperCase()) { //not capital, so generate error } } prevWord = word; } else { //not in collection, generate error } });
I haven't tested it so please let me know in comments if there's some issue. Or some requirement of yours I missed.
Update
As author of question suggested that he don't want to load whole collection on client, you can create a method on server which returns an array of words instead of giving access to client of collection.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.