简体   繁体   English

从大文本文件中提取大行列表

[英]Extract large list of lines from large text file

I need to extract ~5000 lines from a file with ~300,000 lines on bash (OSX). 我需要从bash(OSX)上具有约300,000行的文件中提取约5000行。 Running 运行

sed '128082p;128083p;...(4996 numbers)....;159845q;d' file > output

gives the error 给出错误

sed: 1: "128082p;128083p;128084p ...": command expected

This same command works if I try to extract 10 lines only. 如果我尝试仅提取10行,则此命令也有效。 Whereas running 而跑步

for i in `cat line_file`; do sed -n "$ip" file; done >> output

creates a file that's more than ~5000 lines long. 创建一个长度超过5000行的文件。 What's the right command in either case? 两种情况下正确的命令是什么?

Edit: this is not a range of numbers. 编辑:这不是数字范围。

Tip of the hat to Jonathan Leffler for his help. 乔纳森·莱夫勒Jonathan Leffler)的帮助表示感谢。

It looks like BSD sed as used on macOS (as of macOS 10.12.1) has a hard limit on the size of each line of a script that can be passed to it: 2048 bytes . 看来macOS上使用的BSD sed (自macOS 10.12.1起) 可以传递给脚本的每一的大小都有硬性限制2048个字节

When passed as a command-line argument (implicitly as the first operand, or explicitly via -e options), scripts are typically passed as a single line, as you did. 当作为命令行参数 (隐式地第一个操作数,或者通过明确传递-e选项),脚本通常通过为单行 ,像你一样。

If that single line gets too long, it is regrettably blindly cut off, typically resulting in a seemingly random syntax error, like the one you saw. 如果单行太长,很遗憾会被盲目地切断,通常会导致看似随机的语法错误,就像您看到的那样。

There are two workarounds : 有两种解决方法

  • Make sure that your script contains only short-enough lines by separating commands with \\n (newlines) instead of ; 通过用\\n (换行符)而不是;分隔命令,确保脚本仅包含足够短的行; and/or split your script across multiple -e options (which is cumbersome). 和/或将脚本拆分为多个-e选项(这很麻烦)。

  • Provide the entire script via a file , using the -f option, in which case all commands must be separated with \\n rather than ; 使用-f选项通过文件提供整个脚本,在这种情况下,所有命令都必须\\n而不是;分隔; anyway. 无论如何。
    In the unlikely event that your script is too long to fit on a single command line (a limit imposed by the system - see bottom), using -f is your only option. 万一您的脚本太长而无法容纳在单个命令行中( 系统强加了一个限制,请参阅底部),使用-f是唯一的选择。


Here's an example of a command-line script that is too long: 这是一个太长的命令行脚本示例:

$ sed -n "$(printf '%sp;' {1..432})" <<<'line 1'
sed: 1: "1p;2p;3p;4p;5p;6p;7p;8p ...": command expected # !! ERROR

Even though the script is syntactically correct, cutting its one and only line off at 2048 bytes leaves it incorrect, resulting in the seemingly random command expected error. 即使该脚本在语法上是正确的,但仅以2048字节的形式截断其一行就不会正确,从而导致看似随机的command expected错误。

In this case, working around the limitation is simple: by replacing ; 在这种情况下,解决限制很简单:通过替换; with \\n , the individual lines become short enough: 使用\\n ,各行变得足够短:

$ sed -n "$(printf '%sp\n' {1..432})" <<<'line 1'
line 1 # OK

Since you already have a file of line numbers - line_file - you can use an auxiliary sed command to create your \\n -separated script from it: 由于您已经有一个行号文件line_file您可以使用辅助 sed命令从中创建\\n分隔的脚本:

 $ sed -n "$(sed 's/$/p/' line_file)" file > output

Here's how to solve the problem via a script file passed via -f , in which the commands are \\n -separated fixes the problem: 这是通过-f传递的脚本文件解决问题的方法,在脚本文件中,命令是\\n分隔可解决问题:

$ printf '%sp\n' {1..432} > script.sed # Create script file with \n-separated commands.
$ sed -n -f "script.sed" <<<'line 1' # Pass script file via -f
line 1 # OK

Note: Using a process substitution ( sed -n -f <(printf ...) ... ) as an ad-hoc script file inexplicably does not work. 注意:使用进程替换( sed -n -f <(printf ...) ... )作为一个特设的脚本文件莫名不起作用

Also note that the overall max. 另请注意, 整体最高 length of a command line for invoking an external utility such as sed on macOS (as of 10.12) is 262144 (256 KB; determined with getconf ARG_MAX ), and in practice the limit is lower, because the size of the environment-variable block plays a role. 调用外部实用程序(如macOS上的sed的命令行长度(截至10.12)为262144 (256 KB;由getconf ARG_MAX确定),实际上该限制较低,因为环境变量块的大小一名角色。
If you were to hit that limit, however, you'd get a more helpful error message: Argument list too long . 但是,如果要达到该限制,则会收到一条更有用的错误消息: Argument list too long

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM