[英]How do I check if a character is a Unicode new-line character (not only ASCII) in Rust?
Every programming language has their own interpretation of \\n
and \\r
. 每种编程语言都有自己对
\\n
和\\r
\\n
的解释。 Unicode supports multiple characters that can represent a new line. Unicode支持可以表示新行的多个字符。
From the Rust reference : 从Rust参考 :
A whitespace escape is one of the characters U+006E (n), U+0072 (r), or U+0074 (t), denoting the Unicode values U+000A (LF), U+000D (CR) or U+0009 (HT) respectively.
空白转义是字符U + 006E(n),U + 0072(r)或U + 0074(t)之一,表示Unicode值U + 000A(LF),U + 000D(CR)或U + 0009(HT)。
Based on that statement, I'd say a Rust character is a new-line character if it is either \\n
or \\r
. 基于该语句,我会说如果是
\\n
或\\r
,则Rust字符是换行符。 On Windows it might be the combination of \\r
and \\n
. 在Windows上,它可能是
\\r
和\\n
的组合。 I'm not sure though. 我不确定。
What about the following? 以下怎么样?
In my opinion, we are missing something like a char.is_new_line()
. 在我看来,我们缺少像
char.is_new_line()
这样的东西。 I looked through the Unicode Character Categories but couldn't find a definition for new-lines. 我查看了Unicode字符类别,但找不到新行的定义。
Do I have to come up with my own definition of what a Unicode new-line character is? 我是否必须提出自己对Unicode换行符的定义?
There is considerable practical disagreement between languages like Java, Python, Go and JavaScript as to what constitutes a newline-character and how that translates to "new lines". Java,Python,Go和JavaScript等语言之间存在相当大的实际分歧,即构成换行符的内容以及转换为“新行”的方式。 The disagreement is demonstrated by how the batteries-included regex engines treat patterns like
$
against a string like \\r\\r\\n\\n
in multi-line-mode: Are there two lines ( \\r\\r\\n
, \\n
), three lines ( \\r
, \\r\\n
, \\n
, like Unicode says) or four ( \\r
, \\r
, \\n
, \\n
, like JS sees it)? 包含电池的正则表达式引擎如何在多行模式下对像
\\r\\r\\n\\n
字符串这样的字符串处理类似$
模式表明了分歧:是否有两行( \\r\\r\\n
, \\n
) ,三行( \\r
, \\r\\n
, \\n
,像Unicode说的那样)或四行( \\r
, \\r
, \\n
, \\n
,就像JS看到的那样)? Go and Python do not treat \\r\\n
as a single $
and neither does Rust's regex crate; Go和Python不会将
\\r\\n
视为单个$
,Rust的正则表达式也不会; Java's does however. 然而,Java确实如此。 I don't know of any language whose batteries extend newline-handling to any more Unicode characters.
我不知道任何语言的电池将换行处理扩展到任何更多的Unicode字符。
So the takeaway here is 所以这里的内容是
\\n
is a newline \\n
是换行符 \\r\\n
may be a single newline \\r\\n
可能是一个换行符 \\r\\n
is treated as two newlines \\r\\n
被视为两个换行符 \\r\\n
is "some character followed by a newline" \\r\\n
是“某个字符后跟换行符” If you really need more Unicode characters to be treated as newlines, you'll have to define a function that does that for you. 如果您确实需要将更多Unicode字符视为换行符,则必须定义一个为您执行此操作的函数。 Don't expect real-world input that expects that.
不要指望期望的真实世界输入。 After all, we had the ASCII Record separator for a gazillion years and everybody uses
\\t
instead as well. 毕竟,我们有很多年的ASCII记录分隔符,而且每个人都使用
\\t
来代替。
Update: See http://www.unicode.org/reports/tr14/tr14-32.html#BreakingRules section LB5
for why \\r\\r\\n
should be treated as two line breaks. 更新:请参阅http://www.unicode.org/reports/tr14/tr14-32.html#BreakingRules部分
LB5
了解为什么\\r\\r\\n
应被视为两个换行符。 You could read the whole page to get a grip on how your original question would have to be implemented. 您可以阅读整页以了解原始问题的实施方式。 My guess is by the point you reach " South East Asian: line breaks require morphological analysis " you'll close the tab :-)
我的猜测是你到达“ 东南亚:换行需要形态分析 ”你将关闭标签:-)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.