Pandas Dataframe：选择相互链接的行的复杂问题

Question

I have a dataframe df我有一个 dataframe df

Col1   Col2
A      B
C      A
B      D
E      F
G      D
G      H
K      J

and a Series id of IDs和一系列id的 ID

ID
A
F

What I want is, for all letters in id , to select other letters that have any link with a max of 2 intermediates.我想要的是，对于id中的所有字母，到 select 其他与最多 2 个中间体有任何链接的字母。 Let's make the example for A (way easier to understand with the example):让我们制作A的示例（通过示例更容易理解）：

There are 2 lines including A , linked to B and C , so direct links to A are [B, C] .有 2 行，包括A ，链接到B和C ，所以直接链接到 A 是[B, C] 。 (No matter if A is in Col1 or Col2 ) （无论 A 是在Col1还是Col2中）

A      B
C      A

But B is also linked to D , and D is linked to G :但是B也链接到D ，并且D链接到G ：

B      D
G      D

So links to A are [B, C, D, G] .所以到A的链接是[B, C, D, G] 。 Even though G and H are linked, it would make more than 2 intermediates from A ( A > B > D > G > H making B , D and G as intermediates), so I don't include H in A links lists.即使G和H是链接的，它也会从A产生超过 2 个中间体（ A > B > D > G > H使B 、 D和G作为中间体），所以我不在A链接列表中包含H 。

G      H

I'm looking for a way to search, for all IDs in id , the links list, and save it in id :我正在寻找一种方法来搜索id中的所有 ID、链接列表，并将其保存在id中：

ID   LinksList
A    [B, C, D, G]
F    [E]

I don't mind the type of LinksList (it can be String) as far as I can get the info for a specific ID and work with it.我不介意LinksList的类型（它可以是字符串），只要我可以获得特定 ID 的信息并使用它。 I also don't mind the order of IDs in LinksList , as long as it's complete.我也不介意LinksList中 ID 的顺序，只要它是完整的。

I already found a way to solve the problem, but using 3 for loops, so it takes a really long time.我已经找到了解决问题的方法，但是使用了 3 个for循环，所以需要很长时间。 (For k1 in ID, For k2 range(0,3), select direct links for each element of LinksList + starting element, and put them in LinksList if they're not already in). （对于 ID 中的 k1，对于 k2 范围（0,3），select 直接链接 LinksList + 起始元素的每个元素，如果它们还没有，则将它们放入 LinksList 中）。 Can someone please help me doing it only with Pandas?有人可以帮我只用 Pandas 做吗？ Thanks a lot in advance !!非常感谢提前！

==== EDIT: Here are the "3 loops", after Karl's comment: ==== ==== 编辑：这是卡尔评论后的“3个循环”：====

i = 0
for k in id:
    linklist = list(df[df['Col1'] == k]['Col2']) + list(df[df['Col2'] == k]['Col1'])
    new = df.copy()
    intermediate_count = 1
    while(len(new) > 0 and intermediate_count <= 2):
        nn = new.copy()
        new = []
        for n in nn:
            toadd = list(df[df['Col1'] == n]['Col2']) + list(df[df['Col2'] == n]['Col1'])
            toadd = list(set(toadd).difference(df))
            df = df + toadd
            new = new + toadd
        
    if(i==0):
        d = {'Id': k, 'Linked': linklist}
        df_result = pd.DataFrame(data=d)
        i = 1
    else:
        d = {'Id': k, 'Linked': linklist}
        df_result.append(pd.DataFrame(data=d))

Answer 1

I would first append the reciprocal of the dataframe to be able to always go from Col1 to Col2.我将首先 append dataframe 的倒数，以便能够始终从 Col1 到 Col2 的 go。 Then I would use merges to compute the possible results with 1 and 2 intermediate steps.然后我会使用合并来计算可能的结果，其中包含 1 个和 2 个中间步骤。 Finally, I would aggregate all those values into sets.最后，我会将所有这些值聚合到集合中。 Code could be:代码可以是：

# append the symetric (Col2 -> Col1) to the end of the dataframe
df2 = df.append(df.reindex(columns=reversed(df.columns)).rename(
    columns={df.columns[len(df.columns)-i]: col
             for i, col in enumerate(df.columns, 1)}), ignore_index=True
                ).drop_duplicates()

# add one step on Col3
df3 = df2.merge(df2, 'left', left_on='Col2', right_on='Col1',
                suffixes=('', '_')).drop(columns='Col1_').rename(
                    columns={'Col2_': 'Col3'})

# add one second stop on Col4
df4 = df3.merge(df2, 'left', left_on='Col3', right_on='Col1',
                suffixes=('', '_')).drop(columns='Col1_').rename(
                    columns={'Col2_': 'Col4'})

# aggregate Col2 to Col4 into a set
df4['Links'] = df4.iloc[:, 1:].agg(set, axis=1)

# aggregate that new column grouped by Col1
result = df4.groupby('Col1')['Links'].agg(lambda x: set.union(*x)).reset_index()

# remove the initial value if present in Links
result['Links'] = result['Links'] - result['Col1'].apply(set)

# and display the result restricted to id
print(result[result['Col1'].isin(id)])

With the sample data, it gives as expected:使用示例数据，它按预期给出：

  Col1         Links
0    A  {D, C, B, G}
5    F           {E}

Answer 2

We can use Networkx library:我们可以使用 Networkx 库：

import networkx as nx
import pandas as pd
import matplotlib.pyplot as plt

# Read in pandas dataframe using copy and paste
df = pd.read_clipboard()

# Create graph network from pandas dataframe
G = nx.from_pandas_edgelist(df, 'Col1', 'Col2')

# Create id, Series
id = pd.Series(['A', 'F'])

# Move values in the index of the Series
id.index=id

# Use `single_source_shortest_path` method in nx for each value in, id, Series
id.apply(lambda x: list(nx.single_source_shortest_path(G, x, 3).keys())[1:])

Output: Output：

A    [B, C, D, G]
F             [E]
dtype: object

Print graph representation:打印图形表示：

Pandas Dataframe：选择相互链接的行的复杂问题

问题描述

2 个解决方案

解决方案1
2 已采纳 2021-03-24 14:49:27

解决方案2
2 2021-03-24 15:51:45

Pandas Dataframe：选择相互链接的行的复杂问题

问题描述

2 个解决方案

解决方案1 2 已采纳 2021-03-24 14:49:27

解决方案2 2 2021-03-24 15:51:45

解决方案1
2 已采纳 2021-03-24 14:49:27

解决方案2
2 2021-03-24 15:51:45