简体   繁体   中英

How do I check if a character is a Unicode new-line character (not only ASCII) in Rust?

Every programming language has their own interpretation of \\n and \\r . Unicode supports multiple characters that can represent a new line.

From the Rust reference :

A whitespace escape is one of the characters U+006E (n), U+0072 (r), or U+0074 (t), denoting the Unicode values U+000A (LF), U+000D (CR) or U+0009 (HT) respectively.

Based on that statement, I'd say a Rust character is a new-line character if it is either \\n or \\r . On Windows it might be the combination of \\r and \\n . I'm not sure though.

What about the following?

  • Next line character (U+0085)
  • Line separator character (U+2028)
  • Paragraph separator character (U+2029)

In my opinion, we are missing something like a char.is_new_line() . I looked through the Unicode Character Categories but couldn't find a definition for new-lines.

Do I have to come up with my own definition of what a Unicode new-line character is?

There is considerable practical disagreement between languages like Java, Python, Go and JavaScript as to what constitutes a newline-character and how that translates to "new lines". The disagreement is demonstrated by how the batteries-included regex engines treat patterns like $ against a string like \\r\\r\\n\\n in multi-line-mode: Are there two lines ( \\r\\r\\n , \\n ), three lines ( \\r , \\r\\n , \\n , like Unicode says) or four ( \\r , \\r , \\n , \\n , like JS sees it)? Go and Python do not treat \\r\\n as a single $ and neither does Rust's regex crate; Java's does however. I don't know of any language whose batteries extend newline-handling to any more Unicode characters.

So the takeaway here is

  • It is agreed upon that \\n is a newline
  • \\r\\n may be a single newline
  • unless \\r\\n is treated as two newlines
  • unless \\r\\n is "some character followed by a newline"
  • You shall not have any more newlines beside that.

If you really need more Unicode characters to be treated as newlines, you'll have to define a function that does that for you. Don't expect real-world input that expects that. After all, we had the ASCII Record separator for a gazillion years and everybody uses \\t instead as well.

Update: See http://www.unicode.org/reports/tr14/tr14-32.html#BreakingRules section LB5 for why \\r\\r\\n should be treated as two line breaks. You could read the whole page to get a grip on how your original question would have to be implemented. My guess is by the point you reach " South East Asian: line breaks require morphological analysis " you'll close the tab :-)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM