How to collect data based on starting character on line?

Question

so I'm trying to find a more time-efficient way to "grep/search" lines which begin with a specific character/set of characters. I have a 50GB file contained with data sorted via the command LC_ALL='C' sort -u data.txt > data_sorted.txt Then lets say I want to find all lines which begin with horse I would currently do LC_ALL='C' grep -i -E "^horse.*" data_sorted.txt

The issue I'm facing with this command is that grep doesn't AUTOMATICALLY see (and jump to) lines which begin with horse instead it greps directly 0-9A-Z or whatever it does. Is there an alternate method of collating data and it jumps specifically to the first character of your search query to quicken things up.

This is kind of hard to explain, apologies for any confusion.

Answer 1

One possible approach is to use look(1) . while this normally is used to search the system word list dictionary, you can specify a different file, and it does a binary search for lines matching a given prefix.

So you might try:

look horse data_sorted.txt

(Some versions of look might require the -b option to do a binary search; consult your local man page)

If you want to do a case-insensitive search like in your grep case, the file has to be sorted in a case-insensitive way ( sort -f ) and look needs the -f option.

How to collect data based on starting character on line?

Question

1 answers

solution1
0 2020-02-12 01:20:18

How to collect data based on starting character on line?

Question

1 answers

solution1 0 2020-02-12 01:20:18

solution1
0 2020-02-12 01:20:18