简体   繁体   English

在本地机器上利用拼写检查器?

[英]Leveraging Spell Checker on local machine?

I notice that common applications on a given machine (Mac, Linux, or Windows) have their respective spell checkers.我注意到给定机器(Mac、Linux 或 Windows)上的常见应用程序都有各自的拼写检查器。 Everything from various IDE, to MS Word/Office, to Note taking software.从各种 IDE,到 MS Word/Office,再到笔记软件,应有尽有。

I am trying to utilize the built in utility of our respective machines in order to analyze strings for syntactic correctness.我正在尝试利用我们各自机器的内置实用程序来分析字符串的语法正确性。 It seems that I cant just use what is on the machine and would have to likely download a dictionary in which to compare against.似乎我不能只使用机器上的内容,并且可能不得不下载一个字典来进行比较。

I was not sure if there was a better way to accomplish this.我不确定是否有更好的方法来实现这一点。 I was looking at trying to do things locally, but I was not opposed to doing api or curl requests to determine if the words in a string are spelled correctly.我正在考虑尝试在本地做事,但我并不反对通过 api 或 curl 请求来确定字符串中的单词是否拼写正确。

I was looking at:我在看:

  • LanguageTool ( hello wrold failed to return an error) LanguageTool( hello wrold返回错误失败)
  • Google's tbproxy seems to not be functional Google 的 tbproxy 似乎不起作用
  • Dictionary / Meriam-Webster require api keys for automation. Dictionary / Meriam-Webster 需要 api 密钥来实现自动化。

I was looking at Node packages and noticed spell checker modules which encapsulate wordlists as well.我正在查看 Node 包,并注意到拼写检查模块也封装了单词表。

Is there a way to utilize the built in machine dictionaries at all, or would it be ideal if I download a dictionary / wordlist to compare against?有没有办法完全利用内置的机器字典,或者如果我下载字典/词表进行比较是否理想?

I am thinking a wordlist might be best bet, but i didnt want to reinvent the wheel.我认为单词表可能是最好的选择,但我不想重新发明轮子。 What have others done to accomplish similar?其他人做了什么来实现类似的目标?

The Credit is going to Lukas Knuth.功劳归于 Lukas Knuth。 I want to give an explicit how to for using dictionary and nspell.我想给出一个明确的如何使用字典和 nspell 的方法。

Install The following 2 dependancies:安装以下2个依赖:

npm install nspell dictionary-en-us

Here is a Sample File I wrote in order to solve the problem.这是我为了解决问题而编写的示例文件。

// Node File

//  node spellcheck.js [path]
//  path: [optional] either absolute or local path from pwd/cwd

//  if you run the file from within Seg.Ui.Frontend/ it works as well.
//    node utility/spellcheck.js
//  OR from the utility directory using a path:
//    node spellcheck.js ../src/assets/i18n/en.json

var fs = require("fs");
var dictionary = require("dictionary-en-us");
var nspell = require("nspell");
var process = require("process");
// path to use if not defined.
var path = "src/assets/i18n/en.json"

let strings = [];
function getStrings(json){
    let keys = Object.keys(json);
    for (let idx of keys){
        let val = json[idx];
        if (isObject(val)) getStrings(val);
        if (isString(val)) strings.push(val)
    }
}

function sanitizeStrings(strArr){
    let set = new Set();
    for (let sentence of strArr){
        sentence.split(" ").forEach(word => {
            word = word.trim().toLowerCase();
            if (word.endsWith(".") || word.endsWith(":") || word.endsWith(",")) word = word.slice(0, -1);
            if (ignoreThisString(word)) return;
            if (word == "") return;
            if (isNumber(word)) return;
            set.add(word)
        });
    }
    return [ ...set ];
}

function ignoreThisString(word){
    // we need to ignore special cased strings, such as items with
    //  Brackets, Mustaches, Question Marks, Single Quotes, Double Quotes
    let regex = new RegExp(/[\{\}\[\]\'\"\?]/, "gi");
    return regex.test(word);
}

function spellcheck(err, dict){
    if (err) throw err;
    var spell = nspell(dict);
    let misspelled_words = strings.filter( word => {
        return !spell.correct(word)
    });
    misspelled_words.forEach( word => console.log(`Plausible Misspelled Word: ${word}`))
    return misspelled_words;
}

function isObject(obj) { return obj instanceof Object }
function isString(obj) { return typeof obj === "string" }
function isNumber(obj) { return !!parseInt(obj, 10)}

function main(args){
    //node file.js path
    if (args.length >= 3) path = args[2]
    if (!fs.existsSync(path)) {
        console.log(`The path does not exist: ${process.cwd()}/${path}`);
        return;
    }
    var content = fs.readFileSync(path)
    var json = JSON.parse(content);
    getStrings(json);
    // console.log(`String Array (length: ${strings.length}): ${strings}`)
    strings = sanitizeStrings(strings);
    console.log(`String Array (length: ${strings.length}): ${strings}\n\n`)

    dictionary(spellcheck);
}
main(process.argv);

This will return a subset of strings to look at and they may be misspelled or false positives.这将返回要查看的字符串子集,它们可能拼写错误或误报。

A false positive will be denoted as:误报将表示为:

  • An acronym首字母缩略词
  • non US English variants for words单词的非美国英语变体
  • Un recognized Proper Nouns, Days of the Week and Months for example.例如,未识别的专有名词、星期几和月份。
  • Strings which contain parenthese.包含括号的字符串。 This can be augmented out by trimming them off the word.这可以通过将它们从单词中删除来增强。

Obviously, this isnt for all cases, but i added an ignore this string function you can leverage if say it contains a special word or phrase the developers want ignored.显然,这并不适用于所有情况,但我添加了一个忽略此字符串函数,如果它包含开发人员想要忽略的特殊单词或短语,您可以利用它。

This is meant to be run as a node script.这意味着作为节点脚本运行。

Your question is tagged as both NodeJS and Python.您的问题被标记为 NodeJS 和 Python。 This is the NodeJS specific part, but I imagine it's very similar to python.这是 NodeJS 特定的部分,但我想它与 python 非常相似。


Windows (from Windows 8 onward) and Mac OS X do have built-in spellchecking engines. Windows(从 Windows 8 开始)和 Mac OS X 确实有内置的拼写检查引擎。

  • Windows: The "Windows Spell Checking API" is a C/C++ API. Windows:“Windows 拼写检查 API”是一个 C/C++ API。 To use it with NodeJS, you'll need to create a binding.要将它与 NodeJS 一起使用,您需要创建一个绑定。
  • Mac OS X: "NSSpellChecker" is part of AppKit, used for GUI applications. Mac OS X:“NSSpellChecker”是 AppKit 的一部分,用于 GUI 应用程序。 This is an Objective-C API, so again you'll need to create a binding.这是一个Objective-C API,因此您需要再次创建一个绑定。
  • Linux: There is no "OS specific" API here. Linux:这里没有“特定于操作系统的”API。 Most applications use Hunspell but there are alternatives.大多数应用程序使用 Hunspell,但也有其他选择。 This again is a C/C++ library, so bindings are needed.这又是一个 C/C++ 库,因此需要绑定。

Fortunately, there is already a module called spellchecker which has bindings for all of the above.幸运的是,已经有一个名为拼写检查器的模块,它具有上述所有功能的绑定。 This will use the built-in system for the platform it's installed on, but there are multiple drawbacks:这将使用其安装平台的内置系统,但有多个缺点:

1) Native extensions must be build. 1) 必须构建本机扩展。 This one has finished binaries via node-pre-gyp, but these need to be installed for specific platforms.这个已经通过 node-pre-gyp 完成了二进制文件,但这些需要为特定平台安装。 If you develop on Mac OS X, run npm install to get the package and then deploy your application on Linux (with the node_modules -directory), it won't work.如果你在 Mac OS X 上开发,运行npm install来获取包,然后在 Linux 上部署你的应用程序(使用node_modules ),它不会工作。

2) Using build-in spellchecking will use defaults dictated by the OS, which might not be what you want. 2) 使用内置拼写检查将使用操作系统规定的默认值,这可能不是您想要的。 For example, the used language might be dictated by the selected OS language.例如,使用的语言可能由所选的操作系统语言决定。 For a UI application (for example build with Electron) this might be fine, but if you want to do server-side spellchecking in languages other than the OS language, it might prove difficult.对于 UI 应用程序(例如使用 Electron 构建),这可能没问题,但如果您想使用操作系统语言以外的语言进行服务器端拼写检查,则可能会很困难。


At the basic level, spellchecking some text boils down to:在基本层面,拼写检查一些文本归结为:

  1. Tokenizing the string (eg by spaces)标记字符串(例如通过空格)
  2. Checking every token against a list of known correct words根据已知正确单词列表检查每个标记
  3. (Bonus) Gather suggestions for wrong tokens and give the user options. (奖励)收集错误令牌的建议并为用户提供选项。

You can write part 1 yourself.您可以自己编写第 1 部分。 Part 2 and 3 require a "list of known correct words" or a dictionary.第 2 部分和第 3 部分需要“已知正确单词列表”或字典。 Fortunately, there is a format and tools to work with it already:幸运的是,已经有一种格式和工具可以使用它:

With this, you get to choose the language, you don't need to build/download any native code and your application will work the same on every platform.有了这个,你就可以选择语言,你不需要构建/下载任何本机代码,你的应用程序将在每个平台上运行相同。 If you're spellchecking on the server, this might be your most flexible option.如果您在服务器上进行拼写检查,这可能是您最灵活的选择。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM