简体   繁体   English

如何在 iOS 中检测文本(字符串)语言?

[英]How to detect text (string) language in iOS?

For instance, given the following strings:例如,给定以下字符串:

let textEN = "The quick brown fox jumps over the lazy dog"
let textES = "El zorro marrón rápido salta sobre el perro perezoso"
let textAR = "الثعلب البني السريع يقفز فوق الكلب الكسول"
let textDE = "Der schnelle braune Fuchs springt über den faulen Hund"

I want to detect the used language in each of them.我想检测他们每个人使用的语言。

Let's assume the signature for the implemented function is:让我们假设实现函数的签名是:

func detectedLanguage<T: StringProtocol>(_ forString: T) -> String?

returns an Optional string in case of no detected language.如果未检测到语言,则返回一个可选字符串。

thus the appropriate result would be:因此,适当的结果是:

let englishDetectedLanguage = detectedLanguage(textEN) // => English
let spanishDetectedLanguage = detectedLanguage(textES) // => Spanish
let arabicDetectedLanguage = detectedLanguage(textAR) // => Arabic
let germanDetectedLanguage = detectedLanguage(textDE) // => German

Is there an easy approach to achieve it?有没有简单的方法来实现它?

Latest versions (iOS 12+)最新版本(iOS 12+)

Briefly:简而言之:

You could achieve it by using NLLanguageRecognizer , as:您可以通过使用NLlanguageRecognizer来实现它,如下所示:

import NaturalLanguage

func detectedLanguage(for string: String) -> String? {
    let recognizer = NLLanguageRecognizer()
    recognizer.processString(string)
    guard let languageCode = recognizer.dominantLanguage?.rawValue else { return nil }
    let detectedLanguage = Locale.current.localizedString(forIdentifier: languageCode)
    return detectedLanguage
}

Older versions (iOS 11+)旧版本(iOS 11+)

Briefly:简而言之:

You could achieve it by using NSLinguisticTagger , as:您可以通过使用NSLinguisticTagger来实现它,如下所示:

func detectedLanguage<T: StringProtocol>(for string: T) -> String? {
    let recognizer = NLLanguageRecognizer()
    recognizer.processString(String(string))
    guard let languageCode = recognizer.dominantLanguage?.rawValue else { return nil }
    let detectedLanguage = Locale.current.localizedString(forIdentifier: languageCode)
    return detectedLanguage
}

Details:细节:

First of all, you should be aware of what are you asking about is mainly relates to the world of Natural language processing (NLP) .首先,你应该知道你问的是什么主要与 自然语言处理(NLP)世界有关。

Since NLP is more than text language detection, the rest of the answer will not contains specific NLP information.由于 NLP 不仅仅是文本语言检测,因此答案的其余部分将不包含特定的 NLP 信息。

Obviously, implementing such a functionality is not that easy, especially when starting to care about the details of the process such as splitting into sentences and even into words, after that recognising names and punctuations etc... I bet you would think of "what a painful process! it is not even logical to do it by myself";显然,实现这样的功能并不是那么容易,尤其是当开始关心过程的细节时,例如拆分成句子甚至拆分成单词,然后识别名称和标点符号等......我敢打赌你会想到“什么一个痛苦的过程!我自己做这件事甚至不合逻辑”; Fortunately, iOS does supports NLP (actually, NLP APIs are available for all Apple platforms, not only the iOS) to make what are you aiming for to be easy to be implemented.幸运的是,iOS确实支持 NLP(实际上,NLP API 可用于所有 Apple 平台,不仅仅是 iOS),使您的目标易于实现。 The core component that you would work with is NSLinguisticTagger :您将使用的核心组件是NSLinguisticTagger

Analyze natural language text to tag part of speech and lexical class, identify names, perform lemmatization, and determine the language and script.分析自然语言文本以标记词性和词汇类别、识别名称、执行词形还原以及确定语言和脚本。

NSLinguisticTagger provides a uniform interface to a variety of natural language processing functionality with support for many different languages and scripts. NSLinguisticTagger为各种自然语言处理功能提供了统一的接口,并支持许多不同的语言和脚本。 You can use this class to segment natural language text into paragraphs, sentences, or words, and tag information about those segments, such as part of speech, lexical class, lemma, script, and language.您可以使用此类将自然语言文本分割为段落、句子或单词,并标记有关这些段的信息,例如词性、词汇类、引理、脚本和语言。

As mentioned in the class documentation, the method that you are looking for - under Determining the Dominant Language and Orthography section- is dominantLanguage(for:) :如类文档中所述,您正在寻找的方法 - 在确定主导语言和正字法部分下 - 是dominantLanguage(for:)

Returns the dominant language for the specified string.返回指定字符串的主要语言。

. .

. .

Return Value返回值

The BCP-47 tag identifying the dominant language of the string, or the tag "und" if a specific language cannot be determined. BCP-47标签标识字符串的主要语言,如果无法确定特定语言,则使用标签“und”。

You might notice that the NSLinguisticTagger is exist since back to iOS 5. However, dominantLanguage(for:) method is only supported for iOS 11 and above, that's because it has been developed on top of the Core ML Framework :您可能会注意到NSLinguisticTagger是从 iOS 5 NSLinguisticTagger就存在的。 但是,在 iOS 11 及更高版本上支持dominantLanguage(for:)方法,这是因为它是在Core ML Framework之上开发的:

. . . . . .

Core ML is the foundation for domain-specific frameworks and functionality. Core ML 是特定领域框架和功能的基础。 Core ML supports Vision for image analysis, Foundation for natural language processing (for example, the NSLinguisticTagger class) , and GameplayKit for evaluating learned decision trees. Core ML 支持用于图像分析的 Vision、用于自然语言处理的Foundation (例如NSLinguisticTagger类)和用于评估学习决策树的 GameplayKit。 Core ML itself builds on top of low-level primitives like Accelerate and BNNS, as well as Metal Performance Shaders. Core ML 本身建立在 Accelerate 和 BNNS 等低级原语以及 Metal Performance Shaders 之上。

在此处输入图片说明

Based on the returned value from calling dominantLanguage(for:) by passing "The quick brown fox jumps over the lazy dog":基于通过传递“The quick brown fox jumps over the lazy dog”调用dominantLanguage(for:)的返回值:

NSLinguisticTagger.dominantLanguage(for: "The quick brown fox jumps over the lazy dog")

would be "en" optional string.将是“en”可选字符串。 However, so far that is not the desired output, the expectation is to get "English" instead!但是,到目前为止,这不是所需的输出,而是期望得到“英语”! Well, that is exactly what you should get by calling the localizedString(forLanguageCode:) method from Locale Structure and passing the gotten language code:好吧,这正是您应该通过从Locale Structure 调用localizedString(forLanguageCode:)方法并传递获得的语言代码来获得的:

Locale.current.localizedString(forIdentifier: "en") // English

Putting all together:放在一起:

As mentioned in the "Quick Answer" code snippet, the function would be:如“快速回答”代码片段中所述,该函数将是:

func detectedLanguage<T: StringProtocol>(_ forString: T) -> String? {
    guard let languageCode = NSLinguisticTagger.dominantLanguage(for: String(forString)) else {
        return nil
    }

    let detectedLanguage = Locale.current.localizedString(forIdentifier: languageCode)

    return detectedLanguage
}

Output:输出:

It would be as expected:它会如预期的那样:

let englishDetectedLanguage = detectedLanguage(textEN) // => English
let spanishDetectedLanguage = detectedLanguage(textES) // => Spanish
let arabicDetectedLanguage = detectedLanguage(textAR) // => Arabic
let germanDetectedLanguage = detectedLanguage(textDE) // => German

Note That:请注意:

There still cases for not getting a language name for a given string, like:仍然存在无法获取给定字符串的语言名称的情况,例如:

let textUND = "SdsOE"
let undefinedDetectedLanguage = detectedLanguage(textUND) // => Unknown language

Or it could be even nil :或者它甚至可能nil

let rubbish = "000747322"
let rubbishDetectedLanguage = detectedLanguage(rubbish) // => nil

Still find it a not bad result for providing a useful output...仍然发现提供有用的输出是一个不错的结果......


Furthermore:此外:

About NSLinguisticTagger:关于 NSLinguisticTagger:

Although I will not going to dive deep in NSLinguisticTagger usage, I would like to note that there are couple of really cool features exist in it more than just simply detecting the language for a given a text;虽然我不会深入研究NSLinguisticTagger用法,但我想指出,它存在一些非常酷的功能,而不仅仅是简单地检测给定文本的语言; As a pretty simple example : using the lemma when enumerating tags would be so helpful when working with Information retrieval , since you would be able to recognize the word "driving" passing "drive" word.作为一个非常简单的例子:在枚举标签时使用引理在使用信息检索时非常有用,因为您将能够识别“驾驶”这个词通过“驾驶”这个词。

Official Resources官方资源

Apple Video Sessions :苹果视频会议

Also, for getting familiar with the CoreML:此外,为了熟悉 CoreML:

You can use NSLinguisticTagger's tagAt method.您可以使用 NSLinguisticTagger 的 tagAt 方法。 It support iOS 5 and later.它支持 iOS 5 及更高版本。

func detectLanguage<T: StringProtocol>(for text: T) -> String? {
    let tagger = NSLinguisticTagger.init(tagSchemes: [.language], options: 0)
    tagger.string = String(text)

    guard let languageCode = tagger.tag(at: 0, scheme: .language, tokenRange: nil, sentenceRange: nil) else { return nil }
    return Locale.current.localizedString(forIdentifier: languageCode)
}

detectLanguage(for: "The quick brown fox jumps over the lazy dog")              // English
detectLanguage(for: "El zorro marrón rápido salta sobre el perro perezoso")     // Spanish
detectLanguage(for: "الثعلب البني السريع يقفز فوق الكلب الكسول")                // Arabic
detectLanguage(for: "Der schnelle braune Fuchs springt über den faulen Hund")   // German

I tried NSLinguisticTagger with short input text like hello , it always recognizes as Italian.我用像hello这样的短输入文本尝试了NSLinguisticTagger ,它总是识别为意大利语。 Luckily, Apple recently added NLLanguageRecognizer available on iOS 12, and seems like it more accurate :D幸运的是,Apple 最近在 iOS 12 上添加了NLLanguageRecognizer ,而且似乎更准确:D

import NaturalLanguage

if #available(iOS 12.0, *) {
    let languageRecognizer = NLLanguageRecognizer()
    languageRecognizer.processString(text)
    let code = languageRecognizer.dominantLanguage!.rawValue
    let language = Locale.current.localizedString(forIdentifier: code)
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM