简体   繁体   中英

Leveraging Spell Checker on local machine?

I notice that common applications on a given machine (Mac, Linux, or Windows) have their respective spell checkers. Everything from various IDE, to MS Word/Office, to Note taking software.

I am trying to utilize the built in utility of our respective machines in order to analyze strings for syntactic correctness. It seems that I cant just use what is on the machine and would have to likely download a dictionary in which to compare against.

I was not sure if there was a better way to accomplish this. I was looking at trying to do things locally, but I was not opposed to doing api or curl requests to determine if the words in a string are spelled correctly.

I was looking at:

  • LanguageTool ( hello wrold failed to return an error)
  • Google's tbproxy seems to not be functional
  • Dictionary / Meriam-Webster require api keys for automation.

I was looking at Node packages and noticed spell checker modules which encapsulate wordlists as well.

Is there a way to utilize the built in machine dictionaries at all, or would it be ideal if I download a dictionary / wordlist to compare against?

I am thinking a wordlist might be best bet, but i didnt want to reinvent the wheel. What have others done to accomplish similar?

The Credit is going to Lukas Knuth. I want to give an explicit how to for using dictionary and nspell.

Install The following 2 dependancies:

npm install nspell dictionary-en-us

Here is a Sample File I wrote in order to solve the problem.

// Node File

//  node spellcheck.js [path]
//  path: [optional] either absolute or local path from pwd/cwd

//  if you run the file from within Seg.Ui.Frontend/ it works as well.
//    node utility/spellcheck.js
//  OR from the utility directory using a path:
//    node spellcheck.js ../src/assets/i18n/en.json

var fs = require("fs");
var dictionary = require("dictionary-en-us");
var nspell = require("nspell");
var process = require("process");
// path to use if not defined.
var path = "src/assets/i18n/en.json"

let strings = [];
function getStrings(json){
    let keys = Object.keys(json);
    for (let idx of keys){
        let val = json[idx];
        if (isObject(val)) getStrings(val);
        if (isString(val)) strings.push(val)
    }
}

function sanitizeStrings(strArr){
    let set = new Set();
    for (let sentence of strArr){
        sentence.split(" ").forEach(word => {
            word = word.trim().toLowerCase();
            if (word.endsWith(".") || word.endsWith(":") || word.endsWith(",")) word = word.slice(0, -1);
            if (ignoreThisString(word)) return;
            if (word == "") return;
            if (isNumber(word)) return;
            set.add(word)
        });
    }
    return [ ...set ];
}

function ignoreThisString(word){
    // we need to ignore special cased strings, such as items with
    //  Brackets, Mustaches, Question Marks, Single Quotes, Double Quotes
    let regex = new RegExp(/[\{\}\[\]\'\"\?]/, "gi");
    return regex.test(word);
}

function spellcheck(err, dict){
    if (err) throw err;
    var spell = nspell(dict);
    let misspelled_words = strings.filter( word => {
        return !spell.correct(word)
    });
    misspelled_words.forEach( word => console.log(`Plausible Misspelled Word: ${word}`))
    return misspelled_words;
}

function isObject(obj) { return obj instanceof Object }
function isString(obj) { return typeof obj === "string" }
function isNumber(obj) { return !!parseInt(obj, 10)}

function main(args){
    //node file.js path
    if (args.length >= 3) path = args[2]
    if (!fs.existsSync(path)) {
        console.log(`The path does not exist: ${process.cwd()}/${path}`);
        return;
    }
    var content = fs.readFileSync(path)
    var json = JSON.parse(content);
    getStrings(json);
    // console.log(`String Array (length: ${strings.length}): ${strings}`)
    strings = sanitizeStrings(strings);
    console.log(`String Array (length: ${strings.length}): ${strings}\n\n`)

    dictionary(spellcheck);
}
main(process.argv);

This will return a subset of strings to look at and they may be misspelled or false positives.

A false positive will be denoted as:

  • An acronym
  • non US English variants for words
  • Un recognized Proper Nouns, Days of the Week and Months for example.
  • Strings which contain parenthese. This can be augmented out by trimming them off the word.

Obviously, this isnt for all cases, but i added an ignore this string function you can leverage if say it contains a special word or phrase the developers want ignored.

This is meant to be run as a node script.

Your question is tagged as both NodeJS and Python. This is the NodeJS specific part, but I imagine it's very similar to python.


Windows (from Windows 8 onward) and Mac OS X do have built-in spellchecking engines.

  • Windows: The "Windows Spell Checking API" is a C/C++ API. To use it with NodeJS, you'll need to create a binding.
  • Mac OS X: "NSSpellChecker" is part of AppKit, used for GUI applications. This is an Objective-C API, so again you'll need to create a binding.
  • Linux: There is no "OS specific" API here. Most applications use Hunspell but there are alternatives. This again is a C/C++ library, so bindings are needed.

Fortunately, there is already a module called spellchecker which has bindings for all of the above. This will use the built-in system for the platform it's installed on, but there are multiple drawbacks:

1) Native extensions must be build. This one has finished binaries via node-pre-gyp, but these need to be installed for specific platforms. If you develop on Mac OS X, run npm install to get the package and then deploy your application on Linux (with the node_modules -directory), it won't work.

2) Using build-in spellchecking will use defaults dictated by the OS, which might not be what you want. For example, the used language might be dictated by the selected OS language. For a UI application (for example build with Electron) this might be fine, but if you want to do server-side spellchecking in languages other than the OS language, it might prove difficult.


At the basic level, spellchecking some text boils down to:

  1. Tokenizing the string (eg by spaces)
  2. Checking every token against a list of known correct words
  3. (Bonus) Gather suggestions for wrong tokens and give the user options.

You can write part 1 yourself. Part 2 and 3 require a "list of known correct words" or a dictionary. Fortunately, there is a format and tools to work with it already:

  • simple-spellchecker can work with .dic -files.
  • nspell is a JS implementation of Hunspell, complete with its own dictionary packages.
  • Additional Dictionaries can be found for example in this repo

With this, you get to choose the language, you don't need to build/download any native code and your application will work the same on every platform. If you're spellchecking on the server, this might be your most flexible option.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM