mapreduce的玩具示例

Question

我是hadoop和python的新手。 我想知道如何改进算法。

这是问题所在：（使用mapreduce结构解决）

我们将提供来自新浪微博用户关系的三个大小不同的数据集。 较小的数据集包含1000个用户，中等的数据集包含约250万用户，而较大的数据集包含480万用户。 每个用户都用其唯一的ID号表示。

数据文件的格式如下（不同的跟随者用空格分隔） ：

followee_1_id:follower_1_id follower_2_id follower_3_id ....
followee_2_id:follower_1_id follower_6_id follower_7_id .... ...

例如。

A:B D 
B:A 
C:A B E 
E:A B C

社区检测的结果是，对于每个用户，我们都想了解TOP K最相似的人。 输出格式应为（不同的相似人物之间用空格隔开） ：

User_1:Similiar_Person_1 Similiar_Person_2 ... Similiar_Person_K 
User_2:Similiar_Person_1 Similiar_Person_2 ... Similiar_Person_K

（其中K表示10,000）

我的解决方案：
我的算法是维护一个最多10,000个相似人员的列表，并在相似人员达到10001时对列表进行排序。 然后弹出最后一个。 之后，我发现数据集很大时，大约需要执行(n-10000).n.log(n)时间，有关如何改进它的任何建议？

我的观察：
经过一些粗略的计算，我发现如果相似的人很小，我们应该保持较大的缓冲区。 例如，如果一个人有5000个相似的人，那么我们可以将列表的上限设为100,000。 然后，我们只需要对列表排序一次，即在打印结果之前。

#!/usr/bin/env python

from operator import itemgetter
import sys

def print_list_of_dict(list_of_dic):
    for v in list_of_dic:
        print v['name'],
    print 
return

current_person1 = None
current_person2 = None
current_S = 0
#declare a list of dictionary
ranking = []
d = {}
flag = 0

# input comes from STDIN
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()

    # parse the input we got from mapper.py
    person1, person2 = line.split()

    # first person first relation
    if not current_person1:
        current_person1 = person1
        current_person2 = person2
        current_S += 1
    else:
        # same person , same relation
        if current_person1 == person1 and current_person2 == person2:
            current_S += 1
            flag = 0
        # same person , different relation
        elif current_person1 == person1 and current_person2 != person2:
            d['name'] = current_person2
            d['similarity'] = current_S
            ranking.append(d.copy())
            if len(ranking) == 10001:
                ranking = sorted(ranking,key=itemgetter('similarity'),reverse = True)
                ranking.pop()
            current_person2 = person2
            current_S = 1
            flag = 1
        # different person
        else:
            d['name'] = current_person2
            d['similarity'] = current_S
            ranking.append(d.copy())
            if len(ranking) == 10001:
                ranking = sorted(ranking,key=itemgetter('similarity'),reverse = True)
                ranking.pop()
            ranking = sorted(ranking,key=itemgetter('similarity'),reverse = True)
            print current_person1,':',
            print_list_of_dict(ranking)
            # a new dictionary
            ranking = [] 
            current_person1 = person1
            current_person2 = person2
            current_S = 1
            flag = 2
# add and print the last relation to dictionary
d['name'] = current_person2
d['similarity'] = current_S
ranking.append(d.copy())
if len(ranking) == 10001:
    ranking = sorted(ranking,key=itemgetter('similarity'),reverse = True)
    ranking.pop()
ranking = sorted(ranking,key=itemgetter('similarity'),reverse = True)
print current_person1,':',
print_list_of_dict(ranking)

Answer 1

解决后，将所有内容存储在内存中，并且仅在排序一次后打印前10000个。

mapreduce的玩具示例

问题描述

1 个解决方案

解决方案1
0 已采纳 2016-02-24 13:50:06

mapreduce的玩具示例

问题描述

1 个解决方案

解决方案1 0 已采纳 2016-02-24 13:50:06

解决方案1
0 已采纳 2016-02-24 13:50:06