简体繁体中英

How do I check if a character is a Unicode new-line character (not only ASCII) in Rust?

原文 2017-07-09 11:23:22 6 1 unicode/ rust/ newline/ carriage-return/ linefeed

Every programming language has their own interpretation of \\n and \\r . Unicode supports multiple characters that can represent a new line.

From the Rust reference :

A whitespace escape is one of the characters U+006E (n), U+0072 (r), or U+0074 (t), denoting the Unicode values U+000A (LF), U+000D (CR) or U+0009 (HT) respectively.

Based on that statement, I'd say a Rust character is a new-line character if it is either \\n or \\r . On Windows it might be the combination of \\r and \\n . I'm not sure though.

What about the following?

Next line character (U+0085)
Line separator character (U+2028)
Paragraph separator character (U+2029)

In my opinion, we are missing something like a char.is_new_line() . I looked through the Unicode Character Categories but couldn't find a definition for new-lines.

Do I have to come up with my own definition of what a Unicode new-line character is?

1 answers

There is considerable practical disagreement between languages like Java, Python, Go and JavaScript as to what constitutes a newline-character and how that translates to "new lines". The disagreement is demonstrated by how the batteries-included regex engines treat patterns like $ against a string like \\r\\r\\n\\n in multi-line-mode: Are there two lines ( \\r\\r\\n , \\n ), three lines ( \\r , \\r\\n , \\n , like Unicode says) or four ( \\r , \\r , \\n , \\n , like JS sees it)? Go and Python do not treat \\r\\n as a single $ and neither does Rust's regex crate; Java's does however. I don't know of any language whose batteries extend newline-handling to any more Unicode characters.

So the takeaway here is

It is agreed upon that \\n is a newline
\\r\\n may be a single newline
unless \\r\\n is treated as two newlines
unless \\r\\n is "some character followed by a newline"
You shall not have any more newlines beside that.

If you really need more Unicode characters to be treated as newlines, you'll have to define a function that does that for you. Don't expect real-world input that expects that. After all, we had the ASCII Record separator for a gazillion years and everybody uses \\t instead as well.

Update: See http://www.unicode.org/reports/tr14/tr14-32.html#BreakingRules section LB5 for why \\r\\r\\n should be treated as two line breaks. You could read the whole page to get a grip on how your original question would have to be implemented. My guess is by the point you reach " South East Asian: line breaks require morphological analysis " you'll close the tab :-)

How do I represent a Unicode character in a literal string ISO/ANSI C when the character set is ASCII?

vertical dotted line ascii or unicode character

How do I check if a string is unicode or ascii?

Replace ascii character with unicode

Weird ASCII/Unicode Character

Why are there blank spaces in the Unicode character table and how do I check if a unicode value is one of those?

How to convert a Unicode character to its ASCII equivalent

How do i use this unicode character properly?

How do I get rid of this unicode character?

A commented statement with a unicode new line character

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question How do I represent a Unicode character in a literal string ISO/ANSI C when the character set is ASCII? vertical dotted line ascii or unicode character How do I check if a string is unicode or ascii? Replace ascii character with unicode Weird ASCII/Unicode Character Why are there blank spaces in the Unicode character table and how do I check if a unicode value is one of those? How to convert a Unicode character to its ASCII equivalent How do i use this unicode character properly? How do I get rid of this unicode character? A commented statement with a unicode new line character

Related Tags

How do I check if a character is a Unicode new-line character (not only ASCII) in Rust?

Question

1 answers

solution1 13 ACCPTED 2017-07-09 12:10:22

solution1
13 ACCPTED 2017-07-09 12:10:22