简体   繁体   English

为组创建唯一 ID

[英]Create unique ids for a group

I am working on a problem where I have to group related items and assign a unique id to them.我正在解决一个问题,我必须对相关项目进行分组并为它们分配一个唯一的 ID。 I have written the code in python but it is not returning the expected output.我已经用 python 编写了代码,但它没有返回预期的输出。 I need assistance in refining my logic.我需要帮助来完善我的逻辑。 The code is below:代码如下:

data = {}
child_list = []


for index, row in df.iterrows():
    parent = row['source']
    child = row['target']
    #print 'Parent: ', parent
    #print 'Child:', child
    child_list.append(child)
    #print child_list
    if parent not in data.keys():
        data[parent] = []
    if parent != child:
        data[parent].append(child)
    #print data

op = {}
gid = 0


def recursive(op,x,gid):
    if x in data.keys() and data[x] != []:
        for x_child in data[x]:
            if x_child in data.keys():
                op[x_child] = gid
                recursive(op,x_child,gid)
            else:
                op[x] = gid
    else:
        op[x] = gid


for key in data.keys():
    #print "Key: ", key
    if key not in child_list:
        gid = gid + 1
        op[key] = gid
        for x in data[key]:
            op[x] = gid
            recursive(op,x,gid)

related = pd.DataFrame({'items':op.keys(),
                  'uniq_group_id': op.values()})
mapped.sort_values('items')

Example below下面的例子

Input:
source  target
a        b
b        c
c        c
c        d
d        d
e        f
a        d
h        a
i        f  

Desired Output:
item     uniq_group_id
a         1 
b         1
c         1
d         1
h         1
e         2
f         2
i         2

My code gave me below output which is wrong.我的代码给了我下面的输出,这是错误的。

item    uniq_group_id
a       3
b       3
c       3
d       3
e       1
f       2
h       3
i       2 

Another Example另一个例子

Input:
df = pd.DataFrame({'source': ['a','b','c','c','d','e','a','h','i','a'],
                'target':['b','c','c','d','d','f','d','a','f','a']})
Desired Output:
item    uniq_group_id
a       1
b       1
c       1
d       1
e       2
f       2

My code Output:
item    uniq_group_id
e   1
f   1

The order of the rows or the group id does not matter.行的顺序或组 ID 无关紧要。 The important thing here is to assign related items a same unique identifier.这里重要的是为相关项目分配相同的唯一标识符。 The whole problem is to find related group of items and assign them a unique group id.整个问题是找到相关的项目组并为它们分配一个唯一的组 ID。

I haven't analyzed your code closely, but it looks like the error is because of the way you populate the data dictionary.我没有仔细分析你的代码,但看起来错误是因为你填充data字典的方式。 It stores a child node as being a neighbor of its parent node, but it also needs to store the parent as being a neighbor of the child.它将子节点存储为其父节点的邻居,但它也需要将父节点存储为子节点的邻居。

Rather than attempting to fix your code I decided to adapt this pseudocode written by Aseem Goyal.我决定改编由 Asem Goyal 编写的代码,而不是尝试修复您的代码。 The code below takes its input data from simple Python lists, but it should be easy to adapt it to work with a Pandas dataframe.下面的代码从简单的 Python 列表中获取其输入数据,但它应该很容易适应 Pandas 数据帧。

''' Find all the connected components of an undirected graph '''

from collections import defaultdict

src = ['a', 'b', 'c', 'c', 'd', 'e', 'a', 'h', 'i', 'a']
tgt = ['b', 'c', 'c', 'd', 'd', 'f', 'd', 'a', 'f', 'a']

nodes = sorted(set(src + tgt))
print('Nodes', nodes)

neighbors = defaultdict(set)
for u, v in zip(src, tgt):
    neighbors[u].add(v)
    neighbors[v].add(u)

print('Neighbors')
for n in nodes:
    print(n, neighbors[n])

visited = {}
def depth_first_traverse(node, group_id):
    for n in neighbors[node]:
        if n not in visited:
            visited[n] = group_id
            depth_first_traverse(n, group_id)

print('Groups')
group_id = 1
for n in nodes:
    if n not in visited:
        visited[n] = group_id
        depth_first_traverse(n, group_id)
        group_id += 1
    print(n, visited[n])

output输出

Nodes ['a', 'b', 'c', 'd', 'e', 'f', 'h', 'i']
Neighbors
a {'a', 'd', 'b', 'h'}
b {'a', 'c'}
c {'d', 'b', 'c'}
d {'d', 'a', 'c'}
e {'f'}
f {'i', 'e'}
h {'a'}
i {'f'}
Groups
a 1
b 1
c 1
d 1
e 2
f 2
h 1
i 2

This code was written for Python 3, but will also run on Python 2. If you do run it on Python 2 you should add from __future__ import print_function at the top of your import statements;这段代码是为 Python 3 编写的,但也可以在 Python 2 上运行。如果你在 Python 2 上运行它,你应该在 import 语句的顶部添加from __future__ import print_function ; it's not strictly necessary, but it will make the output look nicer.这不是绝对必要的,但它会使输出看起来更好。

You can use the Union-Find, or Disjoint-Sets algorithm for this.您可以为此使用联合查找或不相交集算法 See this related answer for a more complete explanation.有关更完整的解释,请参阅此相关答案 Basically, you need two functions, union and find , to create a tree (ie a nested dictionary) of leaders or predecessors:基本上,您需要两个函数unionfind来创建leaders或前辈的树(即嵌套字典):

leaders = collections.defaultdict(lambda: None)

def find(x):
    l = leaders[x]
    if l is not None:
        l = find(l)
        leaders[x] = l
        return l
    return x

def union(x, y):
    lx, ly = find(x), find(y)
    if lx != ly:
        leaders[lx] = ly

You can apply this to your problem as follows:您可以将其应用于您的问题,如下所示:

df = pd.DataFrame({'source': ['a','b','c','c','d','e','a','h','i','a'],
                   'target': ['b','c','c','d','d','f','d','a','f','a']})

# build the tree
for _, row in df.iterrows():
    union(row["source"], row["target"])

# build groups based on leaders
groups = collections.defaultdict(set)
for x in leaders:
    groups[find(x)].add(x)
for num, group in enumerate(groups.values(), start=1):
    print(num, group)

Result:结果:

1 {'e', 'f', 'i'}
2 {'h', 'a', 'c', 'd', 'b'}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM