[英]How to collect data based on starting character on line?
so I'm trying to find a more time-efficient way to "grep/search" lines which begin with a specific character/set of characters.所以我试图找到一种更省时的方法来“grep/search”以特定字符/字符集开头的行。 I have a 50GB file contained with data sorted via the command LC_ALL='C' sort -u data.txt > data_sorted.txt
Then lets say I want to find all lines which begin with horse I would currently do LC_ALL='C' grep -i -E "^horse.*" data_sorted.txt
我有一个 50GB 的文件,其中包含通过命令LC_ALL='C' sort -u data.txt > data_sorted.txt
的数据然后假设我想找到所有以horse开头的行,我目前会执行LC_ALL='C' grep -i -E "^horse.*" data_sorted.txt
The issue I'm facing with this command is that grep doesn't AUTOMATICALLY see (and jump to) lines which begin with horse instead it greps directly 0-9A-Z or whatever it does.我在使用此命令时面临的问题是 grep 不会自动查看(并跳转到)以horse开头的行,而是直接 grep 0-9A-Z 或它所做的任何事情。 Is there an alternate method of collating data and it jumps specifically to the first character of your search query to quicken things up.是否有另一种整理数据的方法,它会专门跳转到搜索查询的第一个字符以加快速度。
This is kind of hard to explain, apologies for any confusion.这有点难以解释,为任何混淆道歉。
One possible approach is to use look(1)
.一种可能的方法是使用look(1)
。 while this normally is used to search the system word list dictionary, you can specify a different file, and it does a binary search for lines matching a given prefix.虽然这通常用于搜索系统单词列表字典,但您可以指定一个不同的文件,它会对匹配给定前缀的行进行二进制搜索。
So you might try:所以你可以试试:
look horse data_sorted.txt
(Some versions of look
might require the -b
option to do a binary search; consult your local man page) (某些版本的look
可能需要-b
选项来进行二分搜索;请参阅您当地的手册页)
If you want to do a case-insensitive search like in your grep
case, the file has to be sorted in a case-insensitive way ( sort -f
) and look
needs the -f
option.如果您想像在grep
大小写中那样进行不区分大小写的搜索,则必须以不区分大小写的方式( sort -f
)对文件进行排序,并且look
需要-f
选项。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.