当某些字母发生变化时，如何在复数中找到单数？最好的方法是什么？

Question

How can I find the singular in the plural when some letters change?当某些字母发生变化时，如何在复数中找到单数？

Following situation:以下情况：

The German word Schließfach is a lockbox.德语单词Schließfach是一个密码箱。
The plural is Schließfächer.复数是Schließfächer.

As you see, the letter a has changed in ä .如您所见，字母a在ä中发生了变化。 For this reason, the first word is not a substring of the second one anymore, they are "regex-technically" different.出于这个原因，第一个单词不再是第二个单词的子字符串，它们在“正则表达式技术”上是不同的。

Maybe I'm not in the right corner with my chosen tags below.也许我在下面选择的标签不在正确的角落。 Maybe Regex is not the right tool for me.也许正则表达式对我来说不是正确的工具。 I've seen naturaljs ( natural.NounIflector() ) provides this functionality out of the box for English words.我已经看到naturaljs ( naturaljs ( natural.NounIflector() ) 为英语单词提供了开箱即用的功能。 Maybe there are also solutions for the German language in the same way?也许德语也有同样的解决方案？

What is the best approach, how can I find singular in the plural in German?什么是最好的方法，我如何在德语的复数中找到单数？

Answer 1

I once had to build a text processor that parsed many languages, including very casual to very formal.我曾经不得不构建一个文本处理器来解析多种语言，包括非常随意到非常正式的语言。 One of the things to identify was if certain words were related (like a noun in the title which was related to a list of things - sometimes labeled with a plural form.)要确定的一件事是某些单词是否相关（例如标题中的名词与一系列事物相关 - 有时用复数形式标记。）

IIRC, 70-90% of singular & plural word forms across all languages we supported had a "Levenshtein distance" of less than 3 or 4. (Eventually several dictionaries were added to improve accuracy because "distance" alone produced many false positives.) Another interesting find was that the longer the words, the more likely a distance of 3 or fewer meant a relationship in meaning. IIRC，在我们支持的所有语言中，70-90% 的单复数词形式的“Levenshtein 距离”小于 3 或 4。（最终添加了几个词典以提高准确性，因为仅“距离”就产生了许多误报。）另一个有趣的发现是，单词越长，3 或更少的距离就越有可能意味着意义上的关系。

Here's an example of the libraries we used:这是我们使用的库的示例：

const fastLevenshtein = require('fast-levenshtein');

console.log('Deburred Distances:')
console.log('Score 1:', fastLevenshtein.get('Schließfächer', 'Schließfach'));
// -> 3
console.log('Score 2:', fastLevenshtein.get('Blumtach', 'Blumtächer'));
// -> 3
console.log('Score 3:', fastLevenshtein.get('schließfächer', 'Schliessfaech'));
// -> 7
console.log('Score 4:', fastLevenshtein.get('not-it', 'Schliessfaech'));
// -> 12
console.log('Score 5:', fastLevenshtein.get('not-it', 'Schiesse'));
// -> 8


/**
 * Additional strategy for dealing with other various languages:
 *   "Deburr" the strings to omit diacritics before checking the distance:
 */

const deburr = require('lodash.deburr');
console.log('Deburred Distances:')
console.log('Score 1:', deburr(fastLevenshtein.get('Schließfächer', 'Schließfach')));
// -> 3
console.log('Score 2:', deburr(fastLevenshtein.get('Blumtach', 'Blumtächer')));
// -> 3
console.log('Score 3:', deburr(fastLevenshtein.get('schließfächer', 'Schliessfaech')));
// -> 7


// Same in this case, but helpful in other similar use cases.

Answer 2

You can use a stemmer (which is in fact a lemmatizer) from the nlp.js library, which has models for 40 languages.您可以使用nlp.js库中的词干分析器（实际上是词形还原器），该库具有 40 种语言的模型。

const { StemmerDe } = require('@nlpjs/lang-de');

const stemmer = new StemmerDe();
console.log(stemmer.stemWord('Schließfach'));
console.log(stemmer.stemWord('Schließfächer'));

当某些字母发生变化时，如何在复数中找到单数？最好的方法是什么？

问题描述

2 个解决方案

解决方案1
8 已采纳 2021-06-09 04:24:12

解决方案2
3 2021-06-15 08:08:14

当某些字母发生变化时，如何在复数中找到单数？ 最好的方法是什么？

问题描述

2 个解决方案

解决方案1 8 已采纳 2021-06-09 04:24:12

解决方案2 3 2021-06-15 08:08:14

当某些字母发生变化时，如何在复数中找到单数？最好的方法是什么？

解决方案1
8 已采纳 2021-06-09 04:24:12

解决方案2
3 2021-06-15 08:08:14