Every programming language has their own interpretation of \\n
and \\r
. Unicode supports multiple characters that can represent a new line.
From the Rust reference :
A whitespace escape is one of the characters U+006E (n), U+0072 (r), or U+0074 (t), denoting the Unicode values U+000A (LF), U+000D (CR) or U+0009 (HT) respectively.
Based on that statement, I'd say a Rust character is a new-line character if it is either \\n
or \\r
. On Windows it might be the combination of \\r
and \\n
. I'm not sure though.
What about the following?
In my opinion, we are missing something like a char.is_new_line()
. I looked through the Unicode Character Categories but couldn't find a definition for new-lines.
Do I have to come up with my own definition of what a Unicode new-line character is?
There is considerable practical disagreement between languages like Java, Python, Go and JavaScript as to what constitutes a newline-character and how that translates to "new lines". The disagreement is demonstrated by how the batteries-included regex engines treat patterns like $
against a string like \\r\\r\\n\\n
in multi-line-mode: Are there two lines ( \\r\\r\\n
, \\n
), three lines ( \\r
, \\r\\n
, \\n
, like Unicode says) or four ( \\r
, \\r
, \\n
, \\n
, like JS sees it)? Go and Python do not treat \\r\\n
as a single $
and neither does Rust's regex crate; Java's does however. I don't know of any language whose batteries extend newline-handling to any more Unicode characters.
So the takeaway here is
\\n
is a newline \\r\\n
may be a single newline \\r\\n
is treated as two newlines \\r\\n
is "some character followed by a newline" If you really need more Unicode characters to be treated as newlines, you'll have to define a function that does that for you. Don't expect real-world input that expects that. After all, we had the ASCII Record separator for a gazillion years and everybody uses \\t
instead as well.
Update: See http://www.unicode.org/reports/tr14/tr14-32.html#BreakingRules section LB5
for why \\r\\r\\n
should be treated as two line breaks. You could read the whole page to get a grip on how your original question would have to be implemented. My guess is by the point you reach " South East Asian: line breaks require morphological analysis " you'll close the tab :-)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.