简体   繁体   English

将字符串中的非拉丁字符更改为拉丁字符

[英]Change non-latin character to latin character in string

I'm trying to match by regex in Ruby or in Javascript a string that contains non english characters. 我正在尝试通过Ruby或Javascript中的正则表达式来匹配包含非英文字符的字符串。

So is there a way to replace the string "täglichen" with the string "taglichen" ? 那么有没有办法用字符串“ taglichen”代替字符串“täglichen”? I know that i can replace non english characters by options like: 我知道我可以用以下选项代替非英文字符:

/(?i)t[aä]glichen/

But for this i need dictionary of possible characters and set all of them in searched word. 但是为此,我需要包含可能字符的字典,并将所有字符设置为搜索到的单词。 Maybe there is a more efficient way to do this ? 也许有一种更有效的方法可以做到这一点?

There is a legit solution for modern ruby, using String#unicode_normalize 使用String#unicode_normalize有一种针对现代红宝石的合法解决方案

"täglichen".unicode_normalize(:nfd).
            codepoints.
            reject(&128.method(:<)).
            pack('U*')
#⇒ "taglichen"

To match: 匹配:

"täglichen".unicode_normalize(:nfc) =~ /t[aä]glichen/i
#⇒ 0

The normalization is needed because umlaut might be either a single codepoint 228 or a combined diacritics [97, 776] . 需要归一化是因为变音符号可能是单个代码点228或组合的变音符号[97, 776] Check this (try to copy-paste into your REPL): 对此进行检查(尝试将其复制粘贴到您的REPL中):

"ä" == "ä"
#⇒ false

One thing you can do is slugify your strings before matching ( https://www.npmjs.com/package/slugify ) 您可以做的一件事是在匹配之前对字符串进行束缚( https://www.npmjs.com/package/slugify

Input: "Ich heiße Fred"
Output: "ich-heisse-fred"

If you don't like the - characters as separators you can change that, as stated by the docs 如果您不喜欢-字符作为分隔符,则可以按照docs的说明进行更改

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM