简体   繁体   English

OS X上的csplit不能将'$'识别为行尾字符吗?

[英]Does csplit on OS X not recognise '$' as end-of-line character?

(I'm using Mac OS X, and this question might be specific to that variant of Unix) (我正在使用Mac OS X,这个问题可能是特定于Unix的变种)

I'm trying to split a file using csplit with a regular expression. 我正在尝试使用带有正则表达式的csplit来分割文件。 It consists of various articles merged into one single long text file. 它由合并为一个长文本文件的各种文章组成。 Each article ends with "All Rights Reserved". 每篇文章都以“保留所有权利”结尾。 This is at the end of the line: grep Reserved$ finds them all. 这是在行尾: grep Reserved$找到它们全部。 Only, csplit claims there is no match. 只有, csplit声称没有匹配。

csplit filename /Reserved$/

yields 产量

csplit: Reserved$: no match

which is a clear and obvious lie. 这是一个明显而明显的谎言。 If I leave out the $ , it works; 如果我省略$ ,它就有效; but I want to be sure that I don't get any stray occurrences of 'Reserved' in the middle of the text. 但我想确保在文本中间不会出现任何“保留”的错误。 I tried a different word with the beginning-of-line character ^ , and that seems to work. 我尝试了一个与行尾字符^不同的单词,这似乎有效。 Other words (which do occur at the end of a line in the data) also do not match when used (eg and$ ). 其他的话(在数据中一行的末尾发生的事情)在使用时也不匹配(例如and$ )。

Is this a known bug with OS X? 这是OS X的已知错误吗?

[Update: I made sure it's not a DOS/Unix line end character issue by removing all carriage return characters] [更新:我通过删除所有回车字符确保它不是DOS / Unix行结束字符问题]

I have downloaded the source code of csplit from http://www.opensource.apple.com/source/text_cmds/text_cmds-84/csplit/csplit.c and tested this in the debugger. 我从http://www.opensource.apple.com/source/text_cmds/text_cmds-84/csplit/csplit.c下载了csplit的源代码,并在调试器中对此进行了测试。

The pattern is compiled with 该模式是用。编译的

if (regcomp(&cre, re, REG_BASIC|REG_NOSUB) != 0)
    errx(1, "%s: bad regular expression", re);

and the lines are matched with 并且线条匹配

/* Read and output lines until we get a match. */
first = 1;
while ((p = csplit_getline()) != NULL) {
    if (fputs(p, ofp) == EOF)
        break;
    if (!first && regexec(&cre, p, 0, NULL, 0) == 0)
        break;
    first = 0;
}

The problem is now that the lines returned by csplit_getline() still have a trailing newline character \\n . 现在的问题是csplit_getline()返回的行仍然有一个尾随的换行符\\n Therefore "Reserved" are not the last characters in the string and the pattern "Reserved$" does not match. 因此,“保留”不是字符串中的最后一个字符,并且“保留$”模式不匹配。

After a quick-and-dirty insertion of 快速插入之后

    p[strlen(p)-1] = 0;

to remove the trailing newline from the input string the "Reserved$" pattern worked as expected. 要从输入字符串中删除尾随换行符,“保留$”模式按预期工作。

There seem to be more problems with csplit in Mac OS X, see the remarks to the answer of Looking for correct Regular Expression for csplit (the repetition count {*} does also not work). 在Mac OS X中csplit似乎有更多问题,请参阅为csplit寻找正确的正则表达式答案的评论(重复计数{*}也不起作用)。

Remark: You can match "Reserved" at the end of the line with the following trick: 备注:您可以将该行末尾的“保留”与以下技巧相匹配:

csplit filename /Reserved<Ctrl-V><Ctrl-J>/

where you actually use the Control keys to enter a newline character on the command line. 您实际使用Control键在命令行上输入换行符的位置。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM