简体   繁体   English

sort -o将换行符附加到文件末尾 - 为什么?

[英]sort -o appends newline to end of file - why?

I'm working on a small text file with a list of words in it that I want to add a new word to, and then sort. 我正在处理一个小文本文件,其中包含一个单词列表,我想添加一个新单词,然后排序。 The file doesn't have a newline at the end when I start, but does after the sort. 我启动时文件末尾没有换行符,但排序后却没有换行符。 Why? 为什么? Can I avoid this behavior or is there a way to strip the newline back out? 我可以避免这种行为,还是有办法将换行删除?

Example: 例:

words.txt looks like words.txt看起来像

apple
cookie
salmon

I then run printf "\\norange" >> words.txt; sort words.txt -o words.txt 然后我运行printf "\\norange" >> words.txt; sort words.txt -o words.txt printf "\\norange" >> words.txt; sort words.txt -o words.txt

I use printf rather than echo figuring that'll avoid the newline, but the file then reads 我使用printf而不是echo来确定是否会避免换行,但文件会读取

apple
cookie
orange
salmon
#newline here

If I just run printf "\\norange" >> words.txt orange appears at the bottom of the file, with no newline, ie; 如果我只是运行printf "\\norange" >> words.txt橙色出现在文件的底部,没有换行符,即;

apple
cookie
salmon
orange

This behavior is explicitly defined in the POSIX specification for sort : 此行为在POSIX规范中明确定义sort

The input files shall be text files, except that the sort utility shall add a newline to the end of a file ending with an incomplete last line. 输入文件应为文本文件,但排序实用程序应在以不完整的最后一行结尾的文件末尾添加换行符。

As a UNIX "text file" is only valid if all lines end in newlines, as also defined in the POSIX standard : 因为UNIX“文本文件”仅在所有行以换行符结尾时才有效,如POSIX标准中所定义

Text file - A file that contains characters organized into zero or more lines. 文本文件 - 包含组织为零行或多行的字符的文件。 The lines do not contain NUL characters and none can exceed {LINE_MAX} bytes in length, including the newline character. 这些行不包含NUL字符,并且没有一行可以超过{LINE_MAX}个字节,包括换行符。 Although POSIX.1-2008 does not distinguish between text files and binary files (see the ISO C standard), many utilities only produce predictable or meaningful output when operating on text files. 尽管POSIX.1-2008不区分文本文件和二进制文件(请参阅ISO C标准),但许多实用程序在操作文本文件时仅产生可预测或有意义的输出。 The standard utilities that have such restrictions always specify "text files" in their STDIN or INPUT FILES sections. 具有此类限制的标准实用程序始终在其STDIN或INPUT FILES部分中指定“文本文件”。

Think about what you are asking sort to do. 想想你问什么sort的事情。

You are asking it "take all the lines, and sort them in order." 你问它“采取所有行,并按顺序排序。”

You've given it a file containing four lines, which it splits to the following strings: 你给它一个包含四行的文件,它分成以下字符串:

"salmon\n"
"cookie\n"
"orange"

It sorts these for you dutifully: 它尽职尽责地为您排序:

"cookie\n"
"orange"
"salmon\n"

And it then outputs them as a single string: 然后它将它们作为单个字符串输出:

"cookie
orangesalmon
"

That is almost certainly exactly what you do not want. 这几乎可以肯定是你不想要的。

So instead, if your file is missing the terminating newline that it should have had , the sort program understands that, most likely, you still intended that last line to be a line, rather than just a fragment of a line. 因此,如果您的文件缺少应该具有的终止换行符,则sort程序会理解,您最有可能仍然希望最后一行成为一行,而不仅仅是一行的片段。 It appends a \\n to the string "orange", making it "orange\\n". 它将\\ n附加到字符串“orange”,使其成为“orange \\ n”。 Then it can be sorted properly, without "orange" getting concatenated with whatever line happens to come immediately after it: 然后它可以正确排序,没有“橙色”连接到它后面发生的任何行:

"cookie\n"
"orange\n"
"salmon\n"

So when it then outputs them as a single string, it looks a lot better: 因此,当它将它们作为单个字符串输出时,它看起来好多了:

"cookie
orange
salmon
"

You could strip the last character off the file, the one from the end of "salmon\\n", using a range of handy tools such as awk , sed , perl , php , or even raw bash . 可以使用一系列方便的工具(如awksedperlphp甚至原始bash从文件中删除最后一个字符,即“salmon \\ n”末尾的字符。 This is covered elsewhere, in places like: 这在其他地方有所涉及,例如:

How can I remove the last character of a file in unix? 如何在unix中删除文件的最后一个字符?

But please don't do that. 但请不要这样做。 You'll just cause problems for all other utilities that have to handle your files, like sort. 您只会给必须处理文件的所有其他实用程序带来问题,例如排序。 And if you assume that there is no terminating newline in your files, then you will make your code brittle: any part of the toolchain which "fixes" your error (as sort kinda does here) will "break" your code. 如果你假设你的文件中没有终止换行符,那么你将使你的代码变得脆弱:工具链的任何“修复”你的错误的部分(如此处的排序)会“破坏”你的代码。

Instead, treat text files the way they are meant to be treated in unix: a sequence of "lines" (strings of zero or more non-newline bytes), each followed by a newline. 相反,将文本文件视为在unix中处理它们的方式:一系列“行”(零或更多非换行字节的字符串),每个字符后跟一个换行符。

So newlines are line-terminators, not line-separators. 因此换行符是行终止符,而不是行分隔符。

There is a coding style where print s and echo s are done with the newline leading. 有一种编码风格,其中printecho是通过换行符完成的。 This is wrong for many reasons, including creating malformed text files, and causing the output of the program to be concatenated with the command prompt. 出于多种原因这是错误的,包括创建格式错误的文本文件,以及使程序的输出与命令提示符连接在一起。 printf "orange\\n" is correct style, and also more readable: at a glance someone maintaining your code can tell you're printing the word "orange" and a newline, whereas printf "\\norange" looks at first glance like it's printing a backslash and the phrase "no range" with a missing space. printf "orange\\n"是正确的样式,而且更具可读性:一眼认为维护代码的人可以告诉你打印“橙色”和换行符,而printf "\\norange"乍看之下就像打印一样反斜杠和短语“无范围”,缺少空格。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM