简体   繁体   中英

Remove all lines except the last which start with the same string

I'm using awk to process a file to filter lines to specific ones of interest. With the output which is generated, I'd like to be able to remove all lines except the last which start with the same string.

Here's an example of what is generated:

this is a line
duplicate remove me
duplicate this should go too
another unrelated line
duplicate but keep me
example remove this line
example but keep this one
more unrelated text

Lines 2 and 3 should be removed because they start with duplicate , as does line 5. Therefore line 5 should be kept, as it is the last line starting with duplicate .

The same follows for line 6, since it begins with example , as does line 7. Therefore line 7 should be kept, as it is the last line which starts with example .

Given the example above, I'd like to produce the following output:

this is a line
another unrelated line
duplicate but keep me
example but keep this one
more unrelated text

How could I achieve this?

I tried the following, however it doesn't work correctly:

awk -f initialProcessing.awk largeFile | awk '{currentMatch=$1; line=$0; getline; nextMatch=$1; if (currentMatch != nextMatch) {print line}}' - 

Why don't you read the file from the end to the beginning and print the first line containing duplicate ? This way you don't have to worry about what was printed or not, hold the line, etc.

tac file | awk '/duplicate/ {if (f) next; f=1}1' | tac

This sets a flag f the first time duplicate is seen. From the second timem, this flag makes the line be skipped.

If you want to make this generic in a way that every first word is printed just the last time, use an array approach:

tac file | awk '!seen[$1]++' | tac

This keeps track of the first words that have appeared so far. They are stored in the array seen[] , so that by saying !seen[$1]++ we make it True just when $1 occurs for the first time; from the second time on, it evaluates as False and the line is not printed.

Test

$ tac a | awk '!seen[$1]++' | tac
this is a line
another unrelated line
duplicate but keep me
example but keep this one
more unrelated text

您可以使用(关联)数组来始终保持最后一次出现:

awk '{last[$1]=$0;} END{for (i in last) print last[i];}' file

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM