简体   繁体   English

bash / bin / grep:参数列表太长(使用--file选项)

[英]bash /bin/grep: Argument list too long (using --file option)

I have a text file containing 33.869 rows and I have to filter 30.067 of them. 我有一个包含33.869行的文本文件,我必须过滤30.067行。

With an example: 举个例子:

File: input.txt (csv like with 33.869 rows) 文件: input.txt (csv与33.869行一样)

#00001:A123456.10.101.102,first,row,value2,1
#00002:A123456.10.101.103,second,row,value7,85
(omissis)
#33869:A123456.25.170.180,last,test,value9,0

File: filter.txt (list of values separated by "\\n" with 30.067 rows) 文件: filter.txt (由“\\ n”以30.067行分隔的值列表)

A123456.10.101.102
A123456.10.101.103
(omissis)
A123456.24.150.115

(expected) Output file: output.txt (csv like with 30.067 rows taken from input.txt): (预期)输出文件: output.txt (csv与从input.txt获取的30.067行):

#00001:A123456.10.101.102,first,row,value2,1
#00002:A123456.10.101.103,second,row,value7,85
(omissis)
#30067:A123456.24.150.115,whatever,x,y,99

The command I'm using is: 我正在使用的命令是:

#!/bin/bash
/bin/grep --file="filter.txt" input.txt > output.txt

but error returned is 但是返回的错误是

/bin/grep: Argument list too long

Am I forced to split "filter.txt" in smaller chunk? 我是否被迫在较小的块中拆分“filter.txt”?

Which is the limit allowed? 允许的限制是多少?

I did not find the limit on man code command. 我没有找到man code命令的限制。

If there are no regular expressions in the input file, you should switch to grep -F which can read a significantly larger number of input records. 如果输入文件中没有正则表达式,则应切换到grep -F ,它可以读取大量输入记录。

Failing that, splitting the input file would be hugely more efficient than running 30,000+ iterations of grep over the same file. 如果不这样做,拆分输入文件比在同一个文件上运行30,000多次grep迭代要高效得多。

Here's splitting in chunks of 10,000 lines; 这里分为10,000行; adapting to a different factor should be trivial. 适应不同的因素应该是微不足道的。

#!/bin/sh

t=$(mktemp -d -t fgrepsplit.XXXXXXXXXXXX) || exit
trap 'rm -rf "$t"' EXIT       # Remove temp dir when done
trap 'exit 127' HUP INT TERM  # Remove temp dir if interrupted, too

split -l 10000 "$1" "$t"/pat

for p in "$t"/pat*; do
    grep -F -f "$p" "$2"
done

From what you write, I wonder whether grep is the right tool for the job. 从你写的内容来看,我想知道grep是否适合这项工作。 With grep you would usually try to apply a small set of matching rules, expressed as regular expressions. 使用grep您通常会尝试应用一小组匹配规则,表示为正则表达式。 In your case, you match against a long list of literals. 在您的情况下,您匹配一长串文字。

This seems to be a case of finding the lines that full_file.txt and filtered.txt have in common. 这似乎是找到full_file.txtfiltered.txt共有的行的情况。 You might want to look at the following tools to achieve this: 您可能希望查看以下工具来实现此目的:

  • join ( http://linux.die.net/man/1/join ) gives you the lines that two files have in common. joinhttp://linux.die.net/man/1/join )为您提供两个文件共有的行。 Note that both files have to be sorted. 请注意,必须对这两个文件进行排序。 You can use process substitution to achieve this. 您可以使用进程替换来实现此目的。
  • combine ( http://linux.die.net/man/1/combine ) is a more general utility that does not require the input to be sorted. combinehttp://linux.die.net/man/1/combine )是一个更通用的实用程序,不需要对输入进行排序。 But it may not be available everywhere. 但它可能无处不在。

What about iterate on each line of your file ? 如何迭代文件的每一行? something like : 就像是 :

while IFS= read -r  i ; do
   grep "$i" full_file.txt
done < grep_filter.txt >filtered.txt

awk

awk -F"[:,]" 'FNR==NR{a[$2]=$0;next} ($0 in a) {print a[$0]}'  input.txt filter.txt

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM