简体   繁体   English

Python:从共生矩阵创建无向加权图

[英]Python: creating undirected weighted graph from a co-occurrence matrix

I am using Python 2.7 to create a project that would use Twitter data and analyze it.我正在使用 Python 2.7 创建一个项目,该项目将使用 Twitter 数据并对其进行分析。 The main concept is to collect tweets and get the most common hashtags used in that collection of tweets and then I need to create a graph where hashtags would be nodes.主要概念是收集推文并获取该推文集合中最常用的主题标签,然后我需要创建一个图表,其中主题标签将是节点。 If those hashtags would happen to appear in the same tweet that would be an edge in the graph and weight of that edge would be the co-occurrence number.如果这些主题标签碰巧出现在同一条推文中,那么这将是图中的一条边,而该边的权重将是共现数。 So I am trying to create a dictionary of dictionaries using defaultdict(lambda : defaultdict(int)) and create a graph using networkx.from_dict_of_dicts所以我试图使用defaultdict(lambda : defaultdict(int))创建一个字典字典并使用networkx.from_dict_of_dicts创建一个图

My code for creating the co-occurrence matrix is我创建共现矩阵的代码是

def coocurrence (common_entities):


com = defaultdict(lambda : defaultdict(int))

# Build co-occurrence matrix
for i in range(len(common_entities)-1):            
    for j in range(i+1, len(common_entities)):
        w1, w2 = sorted([common_entities[i], common_entities[j]])                
        if w1 != w2:
            com[w1][w2] += 1


return com

But in order to use networkx.from_dict_of_dicts I need it to be in this format: com= {0: {1:{'weight':1}}}但为了使用networkx.from_dict_of_dicts我需要它是这种格式: com= {0: {1:{'weight':1}}}

Do you have any ideas how I can solve this?你有什么想法我可以解决这个问题吗? Or a different way of creating a graph like this?或者像这样创建图表的不同方式?

First of all, I would sort the entities first, so you're not continually running sort inside the loop.首先,我会先对实体进行排序,这样您就不会在循环内不断运行排序。 Then I would use itertools.combinations to get the combinations.然后我会使用 itertools.combinations 来获得组合。 The straightforward translation of what you need with those changes is this:您需要的这些更改的直接翻译是这样的:

from itertools import combinations
from collections import defaultdict


def coocurrence (common_entities):

    com = defaultdict(lambda : defaultdict(lambda: {'weight':0}))

    # Build co-occurrence matrix
    for w1, w2 in combinations(sorted(common_entities), 2):
        if w1 != w2:
            com[w1][w2]['weight'] += 1

    return com

print coocurrence('abcaqwvv')

It may be more efficient (less indexing and fewer objects created) to build something else first and then generate your final answer in a second loop.首先构建其他东西然后在第二个循环中生成最终答案可能更有效(更少的索引和更少的对象创建)。 The second loop won't run for as many cycles as the first because all the counts have already been calculated.第二个循环不会像第一个循环那样运行多少个循环,因为所有计数都已经计算完毕。 Also, since the second loop isn't running for as many cycles, it may be that deferring the if statement to the second loop could save more time.此外,由于第二个循环没有运行那么多周期,因此将if statement推迟到第二个循环可能会节省更多时间。 As usual, run timeit on multiple variations if you care, but here is one possible example of the two loop solution:像往常一样,如果您愿意,可以在多个变体上运行 timeit,但这是两个循环解决方案的一个可能示例:

def coocurrence (common_entities):

    com = defaultdict(int)

    # Build co-occurrence matrix
    for w1, w2 in combinations(sorted(common_entities), 2):
        com[w1, w2] += 1

    result = defaultdict(dict)
    for (w1, w2), count in com.items():
        if w1 != w2:
            result[w1][w2] = {'weight': count}
    return result

print coocurrence('abcaqwvv')

This is the working code and best这是工作代码和最好的

def coocurrence(*inputs):
com = defaultdict(int)

for named_entities in inputs:
    # Build co-occurrence matrix
    for w1, w2 in combinations(sorted(named_entities), 2):
        com[w1, w2] += 1
        com[w2, w1] += 1  #Including both directions

result = defaultdict(dict)
for (w1, w2), count in com.items():
    if w1 != w2:
        result[w1][w2] = {'weight': count}
return result

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM