简体   繁体   English

C++中的单词语言检测

[英]Word language detection in C++

After searching on Google I don't know any standard way or library for detecting whether a particular word is of which language.在 Google 上搜索后,我不知道任何用于检测特定单词是否属于哪种语言的标准方法或库。

Suppose I have any word, how could I find which language it is: English, Japanese, Italian, German etc.假设我有任何单词,我怎么能找到它是哪种语言:英语、日语、意大利语、德语等。

Is there any library available for C++?是否有任何可用于 C++ 的库? Any suggestion in this regard will be greatly appreciated!在这方面的任何建议将不胜感激!

I have found Google's CLD very helpful, it's written in C++, and from their web site:我发现Google 的 CLD非常有用,它是用 C++ 编写的,并且来自他们的网站:

"CLD (Compact Language Detector) is the library embedded in Google's Chromium browser. The library detects the language from provided UTF8 text (plain text or HTML). It's implemented in C++, with very basic Python bindings." “CLD(精简语言检测器)是嵌入在谷歌 Chromium 浏览器中的库。该库从提供的 UTF8 文本(纯文本或 HTML)中检测语言。它是用 C++ 实现的,具有非常基本的 Python 绑定。”

Simple language recognition from words is easy.从单词中识别简单的语言很容易。 You don't need to understand the semantics of the text.您不需要了解文本的语义。 You don't need any computationally expensive algorithms, just a fast hash map.你不需要任何计算成本高的算法,只需要一个快速的哈希映射。 The problem is, you need a lot of data.问题是,您需要大量数据。 Fortunately, you can probably find dictionaries of words in each language you care about.幸运的是,您可能可以找到您关心的每种语言的单词词典。 Define a bit mask for each language, that will allow you to mark words like "the" as recognized in multiple languages.为每种语言定义一个位掩码,这将允许您将像“the”这样的单词标记为在多种语言中被识别。 Then, read each language dictionary into your hash map.然后,将每个语言字典读入您的哈希映射。 If the word is already present from a different language, just mark the current language also.如果该词已经存在于其他语言中,也只需标记当前语言。

Suppose a given word is in English and French.假设一个给定的词是英语和法语。 Then when you look it up ex("commercial") will map to ENGLISH|FRENCH, suppose ENGLISH = 1, FRENCH=2, ... You'll find the value 3. If you want to know whether the words are in your lang only, you would test:然后当你查找 ex("commercial") 将映射到 ENGLISH|FRENCH,假设 ENGLISH = 1, FRENCH=2, ... 你会发现值 3。如果你想知道单词是否在你的仅 lang,您将测试:

int langs = dict["the"];
if (langs | mylang == mylang)
   // no other language



Since there will be other languages, probably a more general approach is better.由于会有其他语言,可能更通用的方法会更好。 For each bit set in the vector, add 1 to the corresponding language.对于向量中设置的每个位,将 1 添加到相应的语言。 Do this for n words.对 n 个单词执行此操作。 After about n=10 words, in a typical text, you'll have 10 for the dominant language, maybe 2 for a language that it is related to (like English/French), and you can determine with high probability that the text is English.在大约 n=10 个单词之后,在典型的文本中,您将有 10 个用于主导语言,可能有 2 个用于与其相关的语言(如英语/法语),并且您可以很有可能确定该文本是英语。 Remember, even if you have a text that is in a language, it can still have a quote in another, so the mere presence of a foreign word doesn't mean the document is in that language.请记住,即使您有一种语言的文本,它仍然可以有另一种语言的引用,因此仅存在外来词并不意味着该文档是该语言的。 Pick a threshhold, it will work quite well (and very, very fast).选择一个阈值,它会工作得很好(而且非常非常快)。

Obviously the hardest thing about this is reading in all the dictionaries.显然,最难的是阅读所有词典。 This isn't a code problem, it's a data collection problem.这不是代码问题,而是数据收集问题。 Fortunately, that's your problem, not mine.幸运的是,那是你的问题,而不是我的。

To make this fast, you will need to preload the hash map, otherwise loading it up initially is going to hurt.为了加快速度,您需要预加载哈希映射,否则最初加载它会受到伤害。 If that's an issue, you will have to write store and load methods for the hash map that block load the entire thing in efficiently.如果这是一个问题,您将必须为哈希映射编写存储和加载方法,以有效地阻止加载整个事物。

Well,好,

Statistically trained language detectors work surprisingly well on single-word inputs, though there are obviously some cases where they can't possible work, as observed by others here.经过统计训练的语言检测器在单个单词输入上的表现出奇地好,尽管在某些情况下它们显然无法工作,正如这里的其他人所观察到的那样。

In Java, I'd send you to Apache Tika.在 Java 中,我会把你送到 Apache Tika。 It has an Open-source statistical language detector.它有一个开源统计语言检测器。

For C++, you could use JNI to call it.对于 C++,您可以使用 JNI 来调用它。 Now, time for a disclaimer warning.现在,是时候发出免责声明警告了。 Since you specifically asked for C++, and since I'm unaware of a C++ free alternative, I will now point you at a product of my employer, which is a statistical language detector, natively in C++.由于您特别要求使用 C++,并且由于我不知道 C++ 的免费替代方案,我现在将向您介绍我雇主的产品,它是一种统计语言检测器,本机使用 C++。

http://www.basistech.com , the product name is RLI. http://www.basistech.com ,产品名称为RLI。

This will not work well one word at a time, as many words are shared.这一次一个词的效果不好,因为很多词是共享的。 For instance, in several languages "the" means "tea."例如,在几种语言中, “the”的意思是“茶”。

Language processing libraries tend to be more comprehensive than just this one feature, and as C++ is a "high-performance" language it might be hard to find one for free.语言处理库往往比这一项功能更全面,而且由于 C++ 是一种“高性能”语言,可能很难找到免费的。

However, the problem might not be too hard to solve yourself.但是,问题可能不会太难自己解决。 See the Wikipedia article on the problem for ideas.请参阅有关该问题的 Wikipedia 文章以获取想法。 Also a small support vector machine might do the trick quite handily.一个小的支持向量机也可以很容易地做到这一点。 Just train it with the most common words in the relevant languages, and you might have a very effective "database" in just a kilobyte or so.只需使用相关语言中最常用的词来训练它,您就可能拥有一个仅 1 KB 左右的非常有效的“数据库”。

I wouldn't hold my breath.我不会屏住呼吸。 It is difficult enough to determine the language of a text automatically.自动确定文本的语言已经很困难了。 If all you have is a single word, without context, you would need a database of all the words of all the languages in the world... the size of which would be prohibitive.如果您只有一个词,没有上下文,您将需要一个包含世界上所有语言的所有词的数据库……其大小将令人望而却步。

Basically you need a huge database of all the major languages.基本上,您需要一个包含所有主要语言的庞大数据库。 To auto-detect the language of a piece of text, pick the language whose dictionary contains the most words from the text.要自动检测一段文本的语言,请选择字典中包含文本中最多单词的语言。 This is not something you would want to have to implement on your laptop.这不是您想要在笔记本电脑上实现的东西。

Spell check first 3 words of your text in all languages (the more words to spell check, the better).用所有语言对文本的前 3 个单词进行拼写检查(拼写检查的单词越多越好)。 The spelling with least number of spelling errors "wins".拼写错误最少的拼写“获胜”。 With only 3 words it is technically possible to have same spelling in a few languages but with each additional word it becomes less probable.仅使用 3 个单词,技术上可以在几种语言中使用相同的拼写,但每增加一个单词,可能性就会降低。 It is not a perfect method, but I figure it would work in most cases.这不是一个完美的方法,但我认为它在大多数情况下都有效。

Otherwise if there is equal number of errors in all languages, use the default language.否则,如果所有语言的错误数量相同,则使用默认语言。 Or randomly pick another 3 words until you have more clear result.或随机选择另外 3 个单词,直到您有更清晰的结果。 Or expand the number of spell checked words to more than 3, until you get a more clear result as well.或者将拼写检查单词的数量扩大到 3 个以上,直到获得更清晰的结果。

As for the spell checking libraries, there are many, I personally prefer Hunspell .至于拼写检查库,有很多,我个人更喜欢Hunspell Nuspell is probably also good. Nuspell可能也不错。 It is a matter of personal opinion and/or technical capabilities which one to use.使用哪个取决于个人意见和/或技术能力。

I assume that you are working with text not with speech.我假设您正在处理文本而不是语音。

If you are working with UNICODE than it has provided slot for each languages.如果您使用的是 UNICODE,那么它为每种语言提供了插槽。

So you can identify that all characters of particular word is fall in this language slot.因此,您可以识别特定单词的所有字符都属于该语言槽。

For more help about unicode language slot you can fine over here有关 unicode 语言插槽的更多帮助,您可以在这里查看

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM