简体   繁体   中英

How to collect data based on starting character on line?

so I'm trying to find a more time-efficient way to "grep/search" lines which begin with a specific character/set of characters. I have a 50GB file contained with data sorted via the command LC_ALL='C' sort -u data.txt > data_sorted.txt Then lets say I want to find all lines which begin with horse I would currently do LC_ALL='C' grep -i -E "^horse.*" data_sorted.txt

The issue I'm facing with this command is that grep doesn't AUTOMATICALLY see (and jump to) lines which begin with horse instead it greps directly 0-9A-Z or whatever it does. Is there an alternate method of collating data and it jumps specifically to the first character of your search query to quicken things up.

This is kind of hard to explain, apologies for any confusion.

One possible approach is to use look(1) . while this normally is used to search the system word list dictionary, you can specify a different file, and it does a binary search for lines matching a given prefix.

So you might try:

look horse data_sorted.txt

(Some versions of look might require the -b option to do a binary search; consult your local man page)

If you want to do a case-insensitive search like in your grep case, the file has to be sorted in a case-insensitive way ( sort -f ) and look needs the -f option.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM