简体   繁体   English

如何检测文本单词的主导语言?

[英]How to detect the dominant language of a text word?

It's looks good for string but it's not working for me for a word . 对于string来说看起来不错,但是对我来说word不起作用。 I am working with search as per as my requirement when user typing any 3 character in the meantime looking to check which language user typing. 当用户同时键入3个字符以检查用户键入哪种语言时,我正在按照我的要求进行搜索。 if I think it should not work with detec0t word but i expect it should be working with Islam word. 如果我认为它不适用于detec0t单词,但我希望它应与Islam单词一起使用。

let tagger = NSLinguisticTagger(tagSchemes:[.tokenType, .language, .lexicalClass, .nameType, .lemma], options: 0)

func determineLanguage(for text: String) {
    tagger.string = text
    let language = tagger.dominantLanguage
    print("The language is \(language!)")
}


//Test case
determineLanguage(for: "I love Islam") // en -pass
determineLanguage(for: "আমি ইসলাম ভালোবাসি") // bn -pass
determineLanguage(for: "أنا أحب الإسلام") // ar -pass
determineLanguage(for: "Islam") // und - failed

Result: 结果:

The language is en 语言是英语
The language is bn 语言是bn
The language is ar 语言是ar
The language is und 语言是

What I missed for "Unknown language" 我错过的“未知语言”

Simply because it belongs to too many languages and it would be unrealistic to guess the language based on one word. 仅仅因为它属于太多的语言,并且基于一个单词来猜测该语言是不现实的。 The context always helps. 上下文总是有帮助的。

For example : 例如 :

import NaturalLanguage

let recognizer = NLLanguageRecognizer()
recognizer.processString("Islam")
print(recognizer.dominantLanguage!.rawValue)  //Force unwrapping for brevity

prints tr , which stands for Turkish. 打印tr ,代表土耳其语。 It's an educated guess. 这是有根据的猜测。

If you want the other languages that were also possible, you could use languageHypotheses(withMaximum:) : 如果您还希望使用其他语言,则可以使用languageHypotheses(withMaximum:)

let hypotheses = recognizer.languageHypotheses(withMaximum: 10)

for (lang, confidence) in hypotheses.sorted(by: { $0.value > $1.value }) {
    print(lang.rawValue, confidence)
}

Which prints 哪些印刷品

 tr 0.2332388460636139 //Turkish hr 0.1371040642261505 //Croatian en 0.12280254065990448 //English pt 0.08051242679357529 de 0.06824589520692825 nl 0.05405258387327194 nb 0.050924140959978104 it 0.037797268480062485 pl 0.03097432479262352 hu 0.0288708433508873 

Now you could define an acceptable threshold of confidence in order to accept that result. 现在,您可以定义一个可接受的置信度阈值以接受该结果。


Language codes can be found here 语言代码可以在这里找到

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM