简体   繁体   English

如何在Rust中检查字符是否是Unicode换行符(不仅是ASCII)?

[英]How do I check if a character is a Unicode new-line character (not only ASCII) in Rust?

Every programming language has their own interpretation of \\n and \\r . 每种编程语言都有自己对\\n\\r \\n的解释。 Unicode supports multiple characters that can represent a new line. Unicode支持可以表示新行的多个字符。

From the Rust reference : Rust参考

A whitespace escape is one of the characters U+006E (n), U+0072 (r), or U+0074 (t), denoting the Unicode values U+000A (LF), U+000D (CR) or U+0009 (HT) respectively. 空白转义是字符U + 006E(n),U + 0072(r)或U + 0074(t)之一,表示Unicode值U + 000A(LF),U + 000D(CR)或U + 0009(HT)。

Based on that statement, I'd say a Rust character is a new-line character if it is either \\n or \\r . 基于该语句,我会说如果是\\n\\r ,则Rust字符是换行符。 On Windows it might be the combination of \\r and \\n . 在Windows上,它可能是\\r\\n的组合。 I'm not sure though. 我不确定。

What about the following? 以下怎么样?

  • Next line character (U+0085) 下一行字符(U + 0085)
  • Line separator character (U+2028) 行分隔符(U + 2028)
  • Paragraph separator character (U+2029) 段落分隔符(U + 2029)

In my opinion, we are missing something like a char.is_new_line() . 在我看来,我们缺少像char.is_new_line()这样的东西。 I looked through the Unicode Character Categories but couldn't find a definition for new-lines. 我查看了Unicode字符类别,但找不到新行的定义。

Do I have to come up with my own definition of what a Unicode new-line character is? 我是否必须提出自己对Unicode换行符的定义?

There is considerable practical disagreement between languages like Java, Python, Go and JavaScript as to what constitutes a newline-character and how that translates to "new lines". Java,Python,Go和JavaScript等语言之间存在相当大的实际分歧,即构成换行符的内容以及转换为“新行”的方式。 The disagreement is demonstrated by how the batteries-included regex engines treat patterns like $ against a string like \\r\\r\\n\\n in multi-line-mode: Are there two lines ( \\r\\r\\n , \\n ), three lines ( \\r , \\r\\n , \\n , like Unicode says) or four ( \\r , \\r , \\n , \\n , like JS sees it)? 包含电池的正则表达式引擎如何在多行模式下对像\\r\\r\\n\\n字符串这样的字符串处理类似$模式表明了分歧:是否有两行( \\r\\r\\n\\n ) ,三行( \\r\\r\\n\\n ,像Unicode说的那样)或四行( \\r\\r\\n\\n ,就像JS看到的那样)? Go and Python do not treat \\r\\n as a single $ and neither does Rust's regex crate; Go和Python不会将\\r\\n视为单个$ ,Rust的正则表达式也不会; Java's does however. 然而,Java确实如此。 I don't know of any language whose batteries extend newline-handling to any more Unicode characters. 我不知道任何语言的电池将换行处理扩展到任何更多的Unicode字符。

So the takeaway here is 所以这里的内容是

  • It is agreed upon that \\n is a newline 同意\\n是换行符
  • \\r\\n may be a single newline \\r\\n可能是一个换行符
  • unless \\r\\n is treated as two newlines 除非\\r\\n被视为两个换行符
  • unless \\r\\n is "some character followed by a newline" 除非\\r\\n是“某个字符后跟换行符”
  • You shall not have any more newlines beside that. 除此之外你不会再有任何换行符了。

If you really need more Unicode characters to be treated as newlines, you'll have to define a function that does that for you. 如果您确实需要将更多Unicode字符视为换行符,则必须定义一个为您执行此操作的函数。 Don't expect real-world input that expects that. 不要指望期望的真实世界输入。 After all, we had the ASCII Record separator for a gazillion years and everybody uses \\t instead as well. 毕竟,我们有很多年的ASCII记录分隔符,而且每个人都使用\\t来代替。

Update: See http://www.unicode.org/reports/tr14/tr14-32.html#BreakingRules section LB5 for why \\r\\r\\n should be treated as two line breaks. 更新:请参阅http://www.unicode.org/reports/tr14/tr14-32.html#BreakingRules部分LB5了解为什么\\r\\r\\n应被视为两个换行符。 You could read the whole page to get a grip on how your original question would have to be implemented. 您可以阅读整页以了解原始问题的实施方式。 My guess is by the point you reach " South East Asian: line breaks require morphological analysis " you'll close the tab :-) 我的猜测是你到达“ 东南亚:换行需要形态分析 ”你将关闭标签:-)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM