简体繁体 English

如何在Rust中检查字符是否是Unicode换行符（不仅是ASCII）？

[英]How do I check if a character is a Unicode new-line character (not only ASCII) in Rust?

原文 2017-07-09 11:23:22 0 1 unicode/ rust/ newline/ carriage-return/ linefeed

Every programming language has their own interpretation of \\n and \\r . 每种编程语言都有自己对\\n和\\r \\n的解释。 Unicode supports multiple characters that can represent a new line. Unicode支持可以表示新行的多个字符。

From the Rust reference : 从Rust参考：

A whitespace escape is one of the characters U+006E (n), U+0072 (r), or U+0074 (t), denoting the Unicode values U+000A (LF), U+000D (CR) or U+0009 (HT) respectively. 空白转义是字符U + 006E（n），U + 0072（r）或U + 0074（t）之一，表示Unicode值U + 000A（LF），U + 000D（CR）或U + 0009（HT）。

Based on that statement, I'd say a Rust character is a new-line character if it is either \\n or \\r . 基于该语句，我会说如果是\\n或\\r ，则Rust字符是换行符。 On Windows it might be the combination of \\r and \\n . 在Windows上，它可能是\\r和\\n的组合。 I'm not sure though. 我不确定。

What about the following? 以下怎么样？

Next line character (U+0085) 下一行字符（U + 0085）
Line separator character (U+2028) 行分隔符（U + 2028）
Paragraph separator character (U+2029) 段落分隔符（U + 2029）

In my opinion, we are missing something like a char.is_new_line() . 在我看来，我们缺少像char.is_new_line()这样的东西。 I looked through the Unicode Character Categories but couldn't find a definition for new-lines. 我查看了Unicode字符类别，但找不到新行的定义。

Do I have to come up with my own definition of what a Unicode new-line character is? 我是否必须提出自己对Unicode换行符的定义？

1 个解决方案

There is considerable practical disagreement between languages like Java, Python, Go and JavaScript as to what constitutes a newline-character and how that translates to "new lines". Java，Python，Go和JavaScript等语言之间存在相当大的实际分歧，即构成换行符的内容以及转换为“新行”的方式。 The disagreement is demonstrated by how the batteries-included regex engines treat patterns like $ against a string like \\r\\r\\n\\n in multi-line-mode: Are there two lines ( \\r\\r\\n , \\n ), three lines ( \\r , \\r\\n , \\n , like Unicode says) or four ( \\r , \\r , \\n , \\n , like JS sees it)? 包含电池的正则表达式引擎如何在多行模式下对像\\r\\r\\n\\n字符串这样的字符串处理类似$模式表明了分歧：是否有两行（ \\r\\r\\n ， \\n ），三行（ \\r ， \\r\\n ， \\n ，像Unicode说的那样）或四行（ \\r ， \\r ， \\n ， \\n ，就像JS看到的那样）？ Go and Python do not treat \\r\\n as a single $ and neither does Rust's regex crate; Go和Python不会将\\r\\n视为单个$ ，Rust的正则表达式也不会; Java's does however. 然而，Java确实如此。 I don't know of any language whose batteries extend newline-handling to any more Unicode characters. 我不知道任何语言的电池将换行处理扩展到任何更多的Unicode字符。

So the takeaway here is 所以这里的内容是

It is agreed upon that \\n is a newline 同意\\n是换行符
\\r\\n may be a single newline \\r\\n可能是一个换行符
unless \\r\\n is treated as two newlines 除非\\r\\n被视为两个换行符
unless \\r\\n is "some character followed by a newline" 除非\\r\\n是“某个字符后跟换行符”
You shall not have any more newlines beside that. 除此之外你不会再有任何换行符了。

If you really need more Unicode characters to be treated as newlines, you'll have to define a function that does that for you. 如果您确实需要将更多Unicode字符视为换行符，则必须定义一个为您执行此操作的函数。 Don't expect real-world input that expects that. 不要指望期望的真实世界输入。 After all, we had the ASCII Record separator for a gazillion years and everybody uses \\t instead as well. 毕竟，我们有很多年的ASCII记录分隔符，而且每个人都使用\\t来代替。

Update: See http://www.unicode.org/reports/tr14/tr14-32.html#BreakingRules section LB5 for why \\r\\r\\n should be treated as two line breaks. 更新：请参阅http://www.unicode.org/reports/tr14/tr14-32.html#BreakingRules部分LB5了解为什么\\r\\r\\n应被视为两个换行符。 You could read the whole page to get a grip on how your original question would have to be implemented. 您可以阅读整页以了解原始问题的实施方式。 My guess is by the point you reach " South East Asian: line breaks require morphological analysis " you'll close the tab :-) 我的猜测是你到达“ 东南亚：换行需要形态分析 ”你将关闭标签:-)