简体   繁体   English

检测字符串的语言

[英]Detect Language of a String

I need to detect the language of a string read from a pdf file the text is basically in English language, but "NLLanguageRecognizer" return that it is "Romanian"我需要检测从 pdf 文件中读取的字符串的语言,文本基本上是英语,但“NLlanguageRecognizer”返回它是“罗马尼亚语”

the function I am using is :我正在使用的功能是:

 class func detectedLangaugeFormat(for string: String) -> String {
       if #available(iOS 12.0, *) {
           let recognizer = NLLanguageRecognizer()
           recognizer.processString(string)
        guard let languageCode = recognizer.dominantLanguage?.rawValue else { return "rtl" }
           let detectedLangauge = Locale.current.localizedString(forIdentifier: languageCode)
           print("lan")
           let currentLocale = NSLocale.current as NSLocale
           let direction: NSLocale.LanguageDirection = NSLocale.characterDirection(forLanguage: languageCode)
            if direction == .rightToLeft {
                return "rtl"
            }else if direction == .leftToRight {
                return "ltr"
            }
       } else {
           // Fallback on earlier versions
       }


    return "rtl"
   }

and the string given to this method is :给这个方法的字符串是:

"\r\n                A Simple PDF File \r\n                   This is a small demonstration .pdf file - \r\n                   just for use in the Virtual Mechanics tutorials. More text. And more \r\n                   text. And more text. And more text. And more text. \r\n                   And more text. And more text. And more text. And more text. And more \r\n                   text. And more text. Boring, zzzzz. And more text. And more text. And \r\n                   more text. And more text. And more text. And more text. And more text. \r\n                   And more text. And more text. \r\n                   And more text. And more text. And more text. And more text. And more \r\n                   text. And more text. And more text. Even more. Continued on page 2 ...\r\n                Simple PDF File 2 \r\n                   ...continued from page 1. Yet more text. And more text. And more text. \r\n                   And more text. And more text. And more text. And more text. And more \r\n                   text. Oh, how boring typing this stuff. But not as boring as watching \r\n                   paint dry. And more text. And more text. And more text. And more text. \r\n                   Boring.  More, a little more text. The end, and just as well. "

One possible solution can be remove more than one spaces in string.一种可能的解决方案是删除字符串中的多个空格。

let regex = try? NSRegularExpression(pattern: "  +", options: .caseInsensitive)
    str = regex?.stringByReplacingMatches(in: str, options: [], range: NSRange(location: 0, length: str.count), withTemplate: " ") ?? ""

I tried your string with this regex and it worked.我用这个正则表达式尝试了你的字符串,它奏效了。 Language recognizer returned en lang code.语言识别器返回了 en lang 代码。

For some reason, white spaces and newlines make the result of processString(_:) to be inefficient.出于某种原因,空格和换行符使processString(_:)的结果效率低下。 What you should do is to get rid of them before passing the string to your method:您应该做的是在将字符串传递给您的方法之前摆脱它们:

let givenString = "\r\n                A Simple PDF File \r\n                   This is a small demonstration .pdf file - \r\n                   just for use in the Virtual Mechanics tutorials. More text. And more \r\n                   text. And more text. And more text. And more text. \r\n                   And more text. And more text. And more text. And more text. And more \r\n                   text. And more text. Boring, zzzzz. And more text. And more text. And \r\n                   more text. And more text. And more text. And more text. And more text. \r\n                   And more text. And more text. \r\n                   And more text. And more text. And more text. And more text. And more \r\n                   text. And more text. And more text. Even more. Continued on page 2 ...\r\n                Simple PDF File 2 \r\n                   ...continued from page 1. Yet more text. And more text. And more text. \r\n                   And more text. And more text. And more text. And more text. And more \r\n                   text. Oh, how boring typing this stuff. But not as boring as watching \r\n                   paint dry. And more text. And more text. And more text. And more text. \r\n                   Boring.  More, a little more text. The end, and just as well. "
let trimmedString = givenString.trimmingCharacters(in: .whitespacesAndNewlines)

let result = detectedLangaugeFormat(for: trimmedString)
print(result) // ltr

At this point, it should be recognizable as English (if you print detectedLangauge inside your method instead of "lan", you'll find it "English").在这一点上,它应该可以被识别为英语(如果你在你的方法中打印detectedLangauge而不是“lan”,你会发现它是“English”)。

let detectedLangauge = Locale.current.localizedString(forIdentifier: languageCode)
print(detectedLangauge) // Optional("English")

Remove non-alphabetic[WhiteSpaces,!,@,#, etc] char in the String then try to detect language.删除字符串中的非字母 [WhiteSpaces,!,@,#,etc] 字符,然后尝试检测语言。

extension String{
    func findFirstAlphabetic() -> String.Index?{
        for index  in self.indices{
            if String(self[index]).isAlphanumeric == true{
                return index
            }
        }
        return nil
    }
    var isAlphanumeric: Bool {
        return !isEmpty && range(of: "[^a-zA-Z0-9]", options: .regularExpression) == nil
    }
    func alphabetic_Leading_SubString() -> String?{
        if let startIndex =  self.findFirstAlphabetic(){
            let newSubString = self[startIndex..<self.endIndex]
            return String(newSubString)
        }
        return nil
    }
}

Usage :-用法 :-

let string = "\r\n                A Simple PDF File \r\n                   This is a small demonstration .pdf file - \r\n                   just for use in the Virtual Mechanics tutorials. More text. And more \r\n                   text. And more text. And more text. And more text. \r\n                   And more text. And more text. And more text. And more text. And more \r\n                   text. And more text. Boring, zzzzz. And more text. And more text. And \r\n                   more text. And more text. And more text. And more text. And more text. \r\n                   And more text. And more text. \r\n                   And more text. And more text. And more text. And more text. And more \r\n                   text. And more text. And more text. Even more. Continued on page 2 ...\r\n                Simple PDF File 2 \r\n                   ...continued from page 1. Yet more text. And more text. And more text. \r\n                   And more text. And more text. And more text. And more text. And more \r\n                   text. Oh, how boring typing this stuff. But not as boring as watching \r\n                   paint dry. And more text. And more text. And more text. And more text. \r\n                   Boring.  More, a little more text. The end, and just as well. "
detectedLangaugeFormat(for: string.alphabetic_Leading_SubString()!)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM