Remove non duplicated lines in text file

Question

I have a long list 100k+ of IP addresses in a certain range, an example from this script is:

67.0.105.76 0
67.0.123.150 0
67.0.123.150 0
67.0.123.150 0
67.0.123.150 0
67.0.123.150 0
67.0.123.150 0
67.0.123.150 0
67.0.123.150 0
67.0.105.76 0
67.0.123.150 0
67.0.163.127 0
67.0.123.150 0
67.0.163.127 0
67.0.163.127 0
67.0.163.127 0
67.0.163.127 0
67.0.163.127 0
67.0.163.127 0
67.0.163.127 0
67.0.163.127 0
67.0.163.127 0
67.0.105.76 0
67.0.105.76 0
67.0.105.76 0
67.0.232.158 0
67.0.232.158 0
67.0.232.158 0
67.0.232.158 0
67.0.232.158 0
67.0.232.158 0
67.0.232.158 0
67.0.232.158 0
67.0.232.158 0
67.0.232.158 0
67.0.105.76 0
67.0.143.13 0

From this list I am wanting to remove any IP's that are not listed multiple times, so say I would like to remove all ips from the list above that aren't listed 5 or more times. It would then output:

67.0.105.76 0
67.0.123.150 0
67.0.163.127 0
67.0.232.158 0

I've tried to accomplish this using sed/uniq in Linux but wasn't able to find a way to do this, would a python script or such be needed for this or is there a possible way using sed/uniq?

Using sort -u 100kfile, it was able to remove all duplicates, but it was still left with the single ip's.

Answer 1

Using sort , uniq and awk :

pu@pumbair: ~  sort data.txt | uniq -c | awk '{if ($1 > 4) print $2,$3}'
67.0.105.76 0
67.0.123.150 0
67.0.163.127 0
67.0.232.158 0

Answer 2

Pure Python solution, using the Counter tool from the collections module .

I have no idea how this will do with 100k addresses, but you could give it a go.

from collections import Counter

with open('ip_file.txt', 'r') as f:
    ip_list     = map(lambda x: x.strip(), f.readlines())
    ip_by_count = Counter(ip_list)

    for ip in ip_by_count:
        if ip_by_count[ip] > 1:
            print ip

Or an alternative approach: maintain two sets, one of IPs seen exactly once, and one for IPs seen at least twice. Print an IP when we see it for a second time, and skip all subsequent appearances:

known_dupes = set()
single_ips  = set()

with open('ip_file.txt', 'r') as f:
    ip_list = map(lambda x: x.strip(), f.readlines())

    for ip in ip_list:
        if ip in known_dupes:
            continue
        elif ip in single_ips:
            print ip
            known_dupes.add(ip)
            single_ips.remove(ip)
        else:
            single_ips.add(ip)

I suspect the first is probably faster, but I haven't tried it on a large file to check.

Answer 3

Here is a simple way to do it in awk

awk '{a[$0]++} END {for (i in a) if (a[i]>4) print i}' file
67.0.232.158 0
67.0.105.76 0
67.0.163.127 0
67.0.123.150 0

Count every unique IP and store the number in an array a
If there are more then 4 hit, print it.
It should be faster then the sort uniq awk

PS I did see after I posted this, its the same as jaypal posted in a comment.

Remove non duplicated lines in text file

Question

3 answers

solution1
4 ACCPTED 2014-09-05 22:40:30

solution2
0 2014-09-05 22:57:47

solution3
0 2014-09-06 05:50:45

Remove non duplicated lines in text file

Question

3 answers

solution1 4 ACCPTED 2014-09-05 22:40:30

solution2 0 2014-09-05 22:57:47

solution3 0 2014-09-06 05:50:45

solution1
4 ACCPTED 2014-09-05 22:40:30

solution2
0 2014-09-05 22:57:47

solution3
0 2014-09-06 05:50:45