简体   繁体   中英

How to detect the dominant language of a text word?

It's looks good for string but it's not working for me for a word . I am working with search as per as my requirement when user typing any 3 character in the meantime looking to check which language user typing. if I think it should not work with detec0t word but i expect it should be working with Islam word.

let tagger = NSLinguisticTagger(tagSchemes:[.tokenType, .language, .lexicalClass, .nameType, .lemma], options: 0)

func determineLanguage(for text: String) {
    tagger.string = text
    let language = tagger.dominantLanguage
    print("The language is \(language!)")
}


//Test case
determineLanguage(for: "I love Islam") // en -pass
determineLanguage(for: "আমি ইসলাম ভালোবাসি") // bn -pass
determineLanguage(for: "أنا أحب الإسلام") // ar -pass
determineLanguage(for: "Islam") // und - failed

Result:

The language is en
The language is bn
The language is ar
The language is und

What I missed for "Unknown language"

Simply because it belongs to too many languages and it would be unrealistic to guess the language based on one word. The context always helps.

For example :

import NaturalLanguage

let recognizer = NLLanguageRecognizer()
recognizer.processString("Islam")
print(recognizer.dominantLanguage!.rawValue)  //Force unwrapping for brevity

prints tr , which stands for Turkish. It's an educated guess.

If you want the other languages that were also possible, you could use languageHypotheses(withMaximum:) :

let hypotheses = recognizer.languageHypotheses(withMaximum: 10)

for (lang, confidence) in hypotheses.sorted(by: { $0.value > $1.value }) {
    print(lang.rawValue, confidence)
}

Which prints

 tr 0.2332388460636139 //Turkish hr 0.1371040642261505 //Croatian en 0.12280254065990448 //English pt 0.08051242679357529 de 0.06824589520692825 nl 0.05405258387327194 nb 0.050924140959978104 it 0.037797268480062485 pl 0.03097432479262352 hu 0.0288708433508873 

Now you could define an acceptable threshold of confidence in order to accept that result.


Language codes can be found here

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM