简体   繁体   中英

How can i remove lines from a file when a string appears on multiple lines?

I have a file that has 2 columns like the following:

apple pear
banana pizza
spoon fork
pizza plate
sausage egg

If a word appears on multiple lines i want to delete all lines that the repeating word appears, as you can see 'pizza' appears twice so 2 lines should be deleted, the following is the required output:

apple pear
spoon fork
sausage egg

I am aware of using :

awk '!seen[$1]++' 

However this only removes the lines when the string appears in one column, i require a command that will check both columns. How can i achieve this?

You could solve the problem in multiple steps by using grep and uniq -d .

First, generate a list of all words using something like grep -Eo '[^ ]+' . Then filter that list so that only duplicated words remain. Filtering can be done using … | sort | uniq -d … | sort | uniq -d … | sort | uniq -d . Finally, print all lines that do not contain any word from the list previously generated using grep -Fwvf listFile inputFile .

In bash all these steps can run in one single command. Here we will use the variable $in to make it easily adaptable.

in="path/to/your/input/file"
grep -Fwvf <(grep -Eo '[^ ]+' "$in" | sort | uniq -d) "$in"

Using awk, you can keep track of many things. Not only if you have seen a word, but also which line the word has been seen on. We keep track of a couple of arrays.

  • record : keeps track of every line we parsed
  • seen : keeps track of the various words as well as the first record number it has been seen on

This gives us:

awk '{ record[NR]=$0 }
     { for(i=1;i<=NF;++i) {
         if ($i in seen) { delete record[NR]; delete record[seen[$i]] }
         else { seen[$i]=NR }
       }
     }
     END { for(i=1;i<=NR;++i) if (i in record) print record[i] }' file 

How does this work?

  • record[NR]=$0 : store the record $0 in an array record indexed by the record number NR
  • for each field/word of the record check if the word has been seen before. If it has been seen, delete the original record from the array record as well as the current record. If it has not been seen, store the word and the current record number in the array seen .
  • When the full file has been processed, check all possible record numbers we have seen, if it is still an index of the array record , print that record.
$ awk '
    NR==FNR {
        for (i=1; i<=NF;i++) {
            if ( firstNr[$i] ) {
                multi[NR]
                multi[firstNr[$i]]
            }
            else {
                firstNr[$i] = NR
            }
        }
        next
    }
    !(FNR in multi)
' file file
apple pear
spoon fork
sausage egg

or if you prefer:

$ awk '
    NR==FNR {
        for (i=1; i<=NF;i++) {
            cnt[$i]++
        }
        next
    }
    {
        for (i=1; i<=NF;i++) {
            if ( cnt[$i] > 1 ) {
                next
            }
        }
        print
    }
' file file
apple pear
spoon fork
sausage egg

This works with your sample:

#!/usr/bin/env sh
filename='x.txt'
for dupe in $(xargs -n1 -a "${filename}" | sort | uniq -d); do
  sed -i.bak -e "/\\<${dupe}\\>/d" "${filename}"
done

It builds a list of words that appears more than once in the file:

  • xargs -n1 -a "${filename}" Outputs the list of all words
    contained in the file (one word per line)
  • | sort | sort Sorts the list
  • | uniq -d | uniq -d Outputs only words that appears more than once into consecutive lines

Then uses sed to select and delete all lines containing the duped word.

This might work for you (GNU grep,sort,uniq,sed):

sed 's/ /\n/g' file | sort |uniq -d | grep -vFf - file

Or a toy GNU sed solution:

cat <<\! | sed -Ef - file
H         # copy file into hold space
$!d       # delete each line of the original file
g         # at EOF replace pattern space with entire file
y/ /\n/;  # put each word on a separate line
# make a list of duplicate words, space separated
:a;s/^(.*\n)(\S+)(\n.*\b\2\b)/\2 \1\3/;ta
s/\n.*//  # remove adulterated file leaving list of duplicates
G         # append original file to list
# remove lines with duplicate words
:b;s/^((\S+) .*)\n[^\n]*\2[^\n]*/\1/;tb
s/^\S+ //;tb # reduce duplicate word list
s/..//    # remove newline artefacts
!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM