简体   繁体   中英

A toy example of mapreduce

I'm a newbie to hadoop and python. I wonder how to improve the algorithm.

This is the problem:(solve it using mapreduce structure)

We will provide three datasets with different sizes, generated from Sina Weibo users' relationship. The smaller dataset contains 1000 users, and the medium one contains around 2.5 million users, and the large one contains 4.8 million users. Each user is represented by its unique ID number.

The format of the data file is as follows (different followers separated by space) :

followee_1_id:follower_1_id follower_2_id follower_3_id ....
followee_2_id:follower_1_id follower_6_id follower_7_id .... ...

eg.

A:B D 
B:A 
C:A B E 
E:A B C

The output of the community detection is that for EVERY user, we want to know the TOP K most similar persons. The output format should be (different similar persons separated by space) :

User_1:Similiar_Person_1 Similiar_Person_2 ... Similiar_Person_K 
User_2:Similiar_Person_1 Similiar_Person_2 ... Similiar_Person_K

(where K means 10,000)

My solution:
My algorithm is to maintain a list of at most 10,000 similar people, and sort the list whenever the number of similar people reaches 10,001. Then pop the last one. After that, i found out when the data set is large, it takes roughly (n-10000).n.log(n) time to execute, any suggestions on how to improve it?

My observation:
After some rough calculation, I found that if the similar person is small, we should keep the buffer large. For example, if a person have 5000 similar people, then we can make the upper limit of the list to be as big as 100,000. Then we only need to sort the list once, ie before printing the result.

#!/usr/bin/env python

from operator import itemgetter
import sys

def print_list_of_dict(list_of_dic):
    for v in list_of_dic:
        print v['name'],
    print 
return

current_person1 = None
current_person2 = None
current_S = 0
#declare a list of dictionary
ranking = []
d = {}
flag = 0

# input comes from STDIN
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()

    # parse the input we got from mapper.py
    person1, person2 = line.split()

    # first person first relation
    if not current_person1:
        current_person1 = person1
        current_person2 = person2
        current_S += 1
    else:
        # same person , same relation
        if current_person1 == person1 and current_person2 == person2:
            current_S += 1
            flag = 0
        # same person , different relation
        elif current_person1 == person1 and current_person2 != person2:
            d['name'] = current_person2
            d['similarity'] = current_S
            ranking.append(d.copy())
            if len(ranking) == 10001:
                ranking = sorted(ranking,key=itemgetter('similarity'),reverse = True)
                ranking.pop()
            current_person2 = person2
            current_S = 1
            flag = 1
        # different person
        else:
            d['name'] = current_person2
            d['similarity'] = current_S
            ranking.append(d.copy())
            if len(ranking) == 10001:
                ranking = sorted(ranking,key=itemgetter('similarity'),reverse = True)
                ranking.pop()
            ranking = sorted(ranking,key=itemgetter('similarity'),reverse = True)
            print current_person1,':',
            print_list_of_dict(ranking)
            # a new dictionary
            ranking = [] 
            current_person1 = person1
            current_person2 = person2
            current_S = 1
            flag = 2
# add and print the last relation to dictionary
d['name'] = current_person2
d['similarity'] = current_S
ranking.append(d.copy())
if len(ranking) == 10001:
    ranking = sorted(ranking,key=itemgetter('similarity'),reverse = True)
    ranking.pop()
ranking = sorted(ranking,key=itemgetter('similarity'),reverse = True)
print current_person1,':',
print_list_of_dict(ranking)

解决后,将所有内容存储在内存中,并且仅在排序一次后打印前10000个。

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM