简体   繁体   中英

Read file line by line and print the first match in each line or “no_data” when nothing matches

I would like to read a text file line by line to search for a pattern; when the first match in a line is found, print it to a file and move to search for the pattern in the next line.

With my limited skills in shell I have tried the following; unfortunately, when there is no first pattern, it never prints no_data to the file d.txt .

while read u ; do
    echo "$u" | grep -o '[0-9]\{2\}/[0-9]\{2\}/[0-9]\{4\}  [0-9]\{2\}:[0-9]\{2\}' |head -1 || echo "no_data" 
done < tmc.txt > d.txt

Note: the pattern I am trying to match is a date and time stamp in the format mm/dd/yyyy hh:mm .

For instance, $u can be a string like this or even bigger with all sorts of garbage:

disk0/bcdackup_20160908_115716/d/.ER/ERORR_log_msnf_20160906_113039:10641:  Test Status:         Failed ;Test PL (some test) was started in execution mode.  09/06/2016  14:43:28.4954  Machine:msnf  (Rl888751, , ?.?, 1637) USER EVENT: TM-1102 DEFAULT  -- SYSTEM ERROR: TX-0003 INIT  Function Protocol Violation. Verification by TXXAxREQxConfig_destroy_config failed: 'engine_ptr != NULL' not TRUE  -- SYSTEM EVENT: ER-0FFF DEFAULT (linked to IH-154B) DEACTIVATE: IH-154b DEACTIVATE: IH-154b  -- SYSTEM EVENT: ER-0FFF DEFAULT (linked to IH-154C) DEACTIVATE: IH-154c DEACTIVATE: IH-154c  -- SYSTEM ERROR: WP-2631 CHANGEPARAMS  Error during processing of Finite State Machine Error starting perform_smooth_landing : event perform_smooth_landing not allowed in state {original_mc, actuator_system_enabled, service_off, not_homed} of state-machine WPLS.V1.2  -- SYSTEM ERROR: WP-2630 CHANGEPARAMS  Error during processing of F   

Any shell utility such as grep, awk, sed, perl is fine by me.

Here's a Perl solution:

perl -nle 'print m{(\d{2}/\d{2}/\d{4} \d{2}:\d{2})} ? $1 : "no_data"' < tmc.txt > d.txt

-n loops over lines in the input.

-l automatically chomps off newlines from the input and adds them to the output.

For each line we do a straightforward regex match with a capture group. If successful, we print the matched string, otherwise no_data .

To do this directly with grep, you'd have to use some kind of variable-length negative look-behind to make sure that you're looking at the first date in the line. Apparently, Perl compatible regular expressions would be able to do that with "backtracking control verbs" , but a) I'm not sure if grep -P supports those and b) you also want to replace non-matching lines, which grep can't do anyway.

As an alternative to calling grep on every line, you could use sed:

sed -r '
    /([0-9]{2}\/){2}[0-9]{4} +[0-9]{2}:[0-9]{2}/! { # On non-matching lines...
        s/.*/no_data/                               # Replace line with "no_data"
        b                                           # Skip to next line
    }
    s/(([0-9]{2}\/){2}[0-9]{4} +[^ ]*).*/\1/ # Remove everything after first date
    s/.*(([0-9]{2}\/){2}[0-9]{4})/\1/        # Remove everything before first date
' infile

For a version of infile using your sample line three times (first with both dates intact, then with the first date removed, then with both dates removed) the output is

$ sed -r '/([0-9]{2}\/){2}[0-9]{4} +[0-9]{2}:[0-9]{2}/!{s/.*/no_data/;b};s/(([0-9]{2}\/){2}[0-9]{4} +[^ ]*).*/\1/;s/.*(([0-9]{2}\/){2}[0-9]{4})/\1/' infile
09/06/2016  14:43:28.4954
08/06/2016  18:53:28.4757
no_data

as expected.

The sed command first checks if the line contains a date; if not, the whole line is replaced by no_data and the rest of the commands is skipped. They wouldn't actually do anything, but this should make execution faster.

If the line does contain a date, two substitutions are performed: the first one removes everything after the first date, the second one everything before it. This has to happen in two steps, or the greedy matching would result in the last date on the line being printed.


Quick performance comparison for a 40 MB input file:

  • Bash loop calling grep on each line: ~24 seconds
  • Sed: ~4 seconds
  • Perl: < 0.1 seconds

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM