查找數據框中的所有依賴項

Question

我有一個數據框：

       Parent   Child1  Child2  Child3  Child4  Child5  Child6
0         A       A1      B2      -1     -1       -1     -1
1         B       B1      -1      -1     -1       -1     -1
2         A1      -1      -1      C1     -1       -1     C2
3         D       -1      C2      -1     A1       -1     -1
4         C1      -1      -1      -1     -1       -1     -1
5         C2      -1      -1      -1     -1       -1     -1
6         B1      -1      -1      -1     -1       -1     -1
7         B2      B3      B4      -1     -1       -1     -1
8         B3      -1      -1      -1     -1       -1     -1
9         B4      -1      -1      -1     -1       -1     -1

來源：

df = pd.DataFrame({'Parent': ['A','B','A1','D','C1','C2','B1','B2','B3','B4'],'Child1': ['A1','B1','-1','-1','-1','-1','-1','B3','-1','-1'], 'Child2': ['B2','-1','-1','C2','-1','-1','-1','B4','-1','-1'] , 'Child3' : ['-1','-1','C1','-1','-1','-1','-1','-1','-1','-1'] , 'Child4' : ['-1','-1','-1','A1','-1','-1','-1','-1','-1','-1'],'Child5' : ['-1','-1','-1','-1','-1','-1','-1','-1','-1','-1'] ,'Child6' : ['-1','-1','C2','-1','-1','-1','-1','-1','-1','-1']})

現在，我有一個輸入列表，其中有幾個父母，例如parent_list = ['A', 'B'] 。 我需要找到所有這些父母的所有孩子。 所以對於“A”，有兩個孩子：A1 和 B2。 A1 再次有兩個孩子“C1”和“C2”。 但是 'C1' 和 'C2' 沒有孩子（如果所有孩子都是 '-1' 他們都沒有孩子）並且繼續 B2 有兩個孩子 - 'B3' 和 'B4'。 B3 和 B4 都沒有孩子，繼續 B 只有一個孩子：'B1' 和 'B1' 沒有孩子。

所以['A', 'B']的最終家族列表將是['A', 'B', 'A1', 'B2', 'C1', 'C2', 'B3', 'B4', 'B1']

這是我能走多遠：

parent_list= ['A','B']
tmp_list = []
output_list = []
child_list= []

for i in parent_list:
  output_list.append(i) if i not in output_list else output_list 
  parent_list.remove(i)
  tmp_list = df.loc[df['Parent']  == i, ['Child1','Child2','Child3','Child4','Child5','Child6']].values.flatten().tolist()
  while '-1' in tmp_list: tmp_list.remove('-1')
  if  tmp_list:
    parent_list = parent_list + tmp_list

但是，我的代碼僅在 parent_list 中為i = 'A'運行並停止。 我不確定為什么它不會進一步迭代。 當我在第一次迭代后檢查 parent_list 時，我確實看到了我想看到的內容，但沒有發生循環。 我哪里做錯了？

另外，如果有更好的方法來解決這個問題，請提出建議。

Answer 1

我們可以melt數據networkx然后在networkx的幫助下創建一個有向圖，然后使用descendents方法找到parent_list每個父節點的所有子節點

import networkx as nx

s = df.melt('Parent').astype(str).query("value != '-1'")
G = nx.from_pandas_edgelist(s, 'Parent', 'value', create_using=nx.DiGraph())
family = parent_list + [d for n in parent_list for d in nx.descendants(G, n)]

>>> family

['A', 'B', 'C1', 'C2', 'B3', 'B2', 'B4', 'A1', 'B1']

Answer 2

所以 for 循環只為 A 運行的原因是因為您在迭代時試圖編輯parent_list 。 所以迭代器在'A'上，然后你將其刪除，所以它在'B'上，然后在塊的末尾再次迭代，這是列表的末尾。 這與您使用parent_list = parent_list + tmp_list重新分配parent_list的值這一事實相結合。 結果，循環迭代器正在查看現在只有“B”的舊parent_list ，而您的變量正在查看具有 B 和 A 的孩子的新parent_list 。

在您的情況下，最簡單的解決方案似乎只是刪除了parent_list.remove(i)因為它似乎沒有必要並且導致了這個問題。 您還需要將+更改為+=或.extend()以便它更新您正在迭代的原始列表。 當我嘗試時，這似乎有效。

我個人認為更好的解決方案是使用遞歸函數來獲取孩子。 就像是：

def get_children(parent):
    child_list = df.loc[df['Parent']  == parent, ['Child1','Child2','Child3','Child4','Child5','Child6']].values.flatten().tolist()
    while '-1' in child_list: child_list.remove('-1')
    for i in child_list:
        child_list.extend(get_children(i))
    # cast to set and back to remove duplicate children
    return list(set(child_list))

我對此進行了測試，它或多或少地實現了您對單親父母的期望。 如果你想要整個家庭，你可以使用這個函數遍歷你的 parent_list，然后將 parent_list 與返回的列表結合起來

Answer 3

您可以使用：

import copy
df.set_index('Parent', inplace=True)
df.replace('-1', np.nan, inplace=True)
parent_list = ['A','B']
family = copy.deepcopy(parent_list)
for p in family:
    family.extend(df.loc[p].dropna().to_list())

輸出：

['A', 'B', 'A1', 'B2', 'B1', 'C1', 'C2', 'B3', 'B4']

查找數據框中的所有依賴項

問題描述

3 個解決方案

解決方案1
3 2021-10-16 05:46:29

解決方案2
1 2021-10-16 05:45:33

解決方案3
1 2021-10-16 06:21:55

查找數據框中的所有依賴項

問題描述

3 個解決方案

解決方案1 3 2021-10-16 05:46:29

解決方案2 1 2021-10-16 05:45:33

解決方案3 1 2021-10-16 06:21:55

解決方案1
3 2021-10-16 05:46:29

解決方案2
1 2021-10-16 05:45:33

解決方案3
1 2021-10-16 06:21:55