简体   繁体   English

如何检查字符串是否包含重音拉丁字符,如 Ruby 中的 é?

[英]How to check if a string contains accented Latin characters like é in Ruby?

Given:鉴于:

str1 = "é"   # Latin accent
str2 = "囧"  # Chinese character
str3 = "ジ"  # Japanese character
str4 = "e"   # English character

How to differentiate str1 (Latin accent characters) from rest of the strings?如何区分str1 (拉丁重音字符)与其他字符串?

Update:更新:

Given给定的

str1 = "\xE9" # Latin accent é actually stored as \xE9 reading from a file

How would the answer be different?答案会有什么不同?

I would first strip out all plain ASCII characters with gsub , and then check with a regex to see if any Latin characters remain.我会先用gsub所有纯 ASCII 字符,然后用正则表达式检查是否还有拉丁字符。 This should detect the accented latin characters.这应该检测带重音的拉丁字符。

def latin_accented?(str)
  str.gsub(/\p{Ascii}/, "") =~ /\p{Latin}/
end

latin_accented?("é")  #=> 0 (truthy)
latin_accented?("囧") #=> nil (falsy)
latin_accented?("ジ") #=> nil (falsy)
latin_accented?("e")  #=> nil (falsy)

Try to use /\\p{Latin}/.match(strX) or /\\p{Latin}&&[^a-zA-Z]/ (if you want to detect only special Latin characters).尝试使用/\\p{Latin}/.match(strX)/\\p{Latin}&&[^a-zA-Z]/ (如果您只想检测特殊的拉丁字符)。

By the way, "e" (str4) is also a Latin character.顺便说一下,“e”(str4)也是一个拉丁字符。

Hope it helps.希望能帮助到你。

I'd use a two-stage approach:我会使用两阶段的方法:

  1. Rule out strings containing non-Latin characters by attempting to encode the string as Latin-1 (ISO-8859-1).通过尝试将字符串编码为 Latin-1 (ISO-8859-1) 来排除包含非拉丁字符的字符串。
  2. Test for accented characters with a regular expression.使用正则表达式测试重音字符。

Example:例子:

def is_accented_latin?(test_string)
  test_string.encode("ISO-8859-1")   # just to see if it raises an exception

  test_string.match(/[ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöùúûüýþÿ]/)
rescue Encoding::UndefinedConversionError
  false
end

I strongly suggest you select for yourself the accented characters you're attempting to screen for, rather than just copying what I've written;我强烈建议您自己选择要筛选的重音字符,而不是仅仅复制我写的内容; I certainly may have missed some.我当然可能错过了一些。 Also note that this will always return false for strings containing non-Latin characters, even if the string also contains a Latin character with an accent.另请注意,对于包含非拉丁字符的字符串,这将始终返回false ,即使该字符串还包含带重音的拉丁字符。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM