简体   繁体   中英

Relating two consecutive lines in a file

I have a txt file of repeating lines like this:

Host: http://de.wikipedia.org
Referer: http://www.wikipedia.org
Host: answers.yahoo.com/
Referer: http://www.yahoo.com
Host: http://de.wikipedia.org
Referer: http://www.wikipedia.org
Host: http://maps.yahoo.com/
Referer: http://www.yahoo.com
Host: http://pt.wikipedia.org
Referer: http://www.wikipedia.org
Host: answers.yahoo.com/
Referer: http://www.yahoo.com
Host: mail.yahoo.com
Referer: http://www.yahoo.com
Host: http://fr.wikipedia.org
Referer: http://www.wikipedia.org
Host: mail.yahoo.com
Referer: http://www.yahoo.com

I am trying with this piece of code to go through the lines and see how many hosts have been accessed through the same referrer:

     dd = {}
for line in open('hosts.txt'):
    if line.startswith('Host'):
        host = line.split(':')[1].strip('\n')
    elif line.startswith('Referer'):
        referer = line.split(': ')[1].strip('\n')
    dd.setdefault(referer, [0 , host])
        dd[referer][0] += 1
print dd

egfrom wikipedia.org, how many links or domains have been accessed.

I want only the first occurrence of any referrer, and for the hosts belonging to that referrer I want the sum of all of them, ignoring the host that has been already counted for the same referrer, so basically whenever the referrer and the host are the same and they have been already counted, I want them to be ignored, to have 'referrer' as key and sum of unique hosts as values, as in below:

{'http://www.wikipedia.org': 3 , 'www.yahoo.com' : 2}

The problem with my code is that it sums all the repeating hosts for the same referrer because I can't figure out how to relate the Host and Referer lines. So any hint or help is highly appreciated.

You could have a set for each referrer in the dictionary, rather than just a number. This way you could just add each host to the set, and duplicates will automatically be discarded. To get the number of hosts for the referrer, get the number of elements in the set.

dd = {}
referrer = None

for line in open('hosts.txt'):
    if line.startswith('Host'):
        host = line.split(': ')[1].strip('\n')
    elif line.startswith('Referer'):
        referrer = line.split(': ')[1].strip('\n')

    if referrer is not None:
        dd.setdefault(referrer, set()).add(host)
        referrer = None

for k, v in dd.iteritems():
    print k, len(v)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM