简体   繁体   English

mapreduce的玩具示例

[英]A toy example of mapreduce

I'm a newbie to hadoop and python. 我是hadoop和python的新手。 I wonder how to improve the algorithm. 我想知道如何改进算法。

This is the problem:(solve it using mapreduce structure) 这是问题所在:(使用mapreduce结构解决)

We will provide three datasets with different sizes, generated from Sina Weibo users' relationship. 我们将提供来自新浪微博用户关系的三个大小不同的数据集。 The smaller dataset contains 1000 users, and the medium one contains around 2.5 million users, and the large one contains 4.8 million users. 较小的数据集包含1000个用户,中等的数据集包含约250万用户,而较大的数据集包含480万用户。 Each user is represented by its unique ID number. 每个用户都用其唯一的ID号表示。

The format of the data file is as follows (different followers separated by space) : 数据文件的格式如下(不同的跟随者用空格分隔)

followee_1_id:follower_1_id follower_2_id follower_3_id ....
followee_2_id:follower_1_id follower_6_id follower_7_id .... ...

eg. 例如。

A:B D 
B:A 
C:A B E 
E:A B C

The output of the community detection is that for EVERY user, we want to know the TOP K most similar persons. 社区检测的结果是,对于每个用户,我们都想了解TOP K最相似的人。 The output format should be (different similar persons separated by space) : 输出格式应为(不同的相似人物之间用空格隔开)

User_1:Similiar_Person_1 Similiar_Person_2 ... Similiar_Person_K 
User_2:Similiar_Person_1 Similiar_Person_2 ... Similiar_Person_K

(where K means 10,000) (其中K表示10,000)

My solution: 我的解决方案:
My algorithm is to maintain a list of at most 10,000 similar people, and sort the list whenever the number of similar people reaches 10,001. 我的算法是维护一个最多10,000个相似人员的列表,并在相似人员达到10001时对列表进行排序。 Then pop the last one. 然后弹出最后一个。 After that, i found out when the data set is large, it takes roughly (n-10000).n.log(n) time to execute, any suggestions on how to improve it? 之后,我发现数据集很大时,大约需要执行(n-10000).n.log(n)时间,有关如何改进它的任何建议?

My observation: 我的观察:
After some rough calculation, I found that if the similar person is small, we should keep the buffer large. 经过一些粗略的计算,我发现如果相似的人很小,我们应该保持较大的缓冲区。 For example, if a person have 5000 similar people, then we can make the upper limit of the list to be as big as 100,000. 例如,如果一个人有5000个相似的人,那么我们可以将列表的上限设为100,000。 Then we only need to sort the list once, ie before printing the result. 然后,我们只需要对列表排序一次,即在打印结果之前。

#!/usr/bin/env python

from operator import itemgetter
import sys

def print_list_of_dict(list_of_dic):
    for v in list_of_dic:
        print v['name'],
    print 
return

current_person1 = None
current_person2 = None
current_S = 0
#declare a list of dictionary
ranking = []
d = {}
flag = 0

# input comes from STDIN
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()

    # parse the input we got from mapper.py
    person1, person2 = line.split()

    # first person first relation
    if not current_person1:
        current_person1 = person1
        current_person2 = person2
        current_S += 1
    else:
        # same person , same relation
        if current_person1 == person1 and current_person2 == person2:
            current_S += 1
            flag = 0
        # same person , different relation
        elif current_person1 == person1 and current_person2 != person2:
            d['name'] = current_person2
            d['similarity'] = current_S
            ranking.append(d.copy())
            if len(ranking) == 10001:
                ranking = sorted(ranking,key=itemgetter('similarity'),reverse = True)
                ranking.pop()
            current_person2 = person2
            current_S = 1
            flag = 1
        # different person
        else:
            d['name'] = current_person2
            d['similarity'] = current_S
            ranking.append(d.copy())
            if len(ranking) == 10001:
                ranking = sorted(ranking,key=itemgetter('similarity'),reverse = True)
                ranking.pop()
            ranking = sorted(ranking,key=itemgetter('similarity'),reverse = True)
            print current_person1,':',
            print_list_of_dict(ranking)
            # a new dictionary
            ranking = [] 
            current_person1 = person1
            current_person2 = person2
            current_S = 1
            flag = 2
# add and print the last relation to dictionary
d['name'] = current_person2
d['similarity'] = current_S
ranking.append(d.copy())
if len(ranking) == 10001:
    ranking = sorted(ranking,key=itemgetter('similarity'),reverse = True)
    ranking.pop()
ranking = sorted(ranking,key=itemgetter('similarity'),reverse = True)
print current_person1,':',
print_list_of_dict(ranking)

解决后,将所有内容存储在内存中,并且仅在排序一次后打印前10000个。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM