簡體   English   中英

如何通過python中的dataframe中的id列創建層次路徑?

[英]How to create hierarchical path by id column in a dataframe in python?

我有一個 dataframe,它有 parent_id、parent_name、id、name、last_category 列。 df 是這樣的:

parent_id   parent_name id      name    last_category
NaN         NaN         1       b       0
1           b           11      b1      0
11          b1          111     b2      0
111         b2          1111    b3      0
1111        b3          11111   b4      1
NaN         NaN         2       a       0
2           a           22      a1      0
22          a1          222     a2      0
222         a2          2222    a3      1

我想用 last_category 列 1 創建 df 的分層路徑。從根類別到最后一個。 所以我將創建的新 dataframe 應該是這樣的 (df_last):

name_path                id_path
b / b1 / b2 / b3 / b4    1 / 11 / 111 / 1111 / 11111
a / a1 / a2 / a3 / a4    2 / 22 / 222 / 2222

這該怎么做?

僅使用 numpy 和 pandas 的解決方案:

# It's easier if we index the dataframe with the `id`
# I assume this ID is unique
df = df.set_index("id")

# `parents[i]` returns the parent ID of `i`
parents = df["parent_id"].to_dict()

paths = {}

# Find all nodes with last_category == 1
for id_ in df.query("last_category == 1").index:
    child_id = id_
    path = [child_id]
    
    # Iteratively travel up the hierarchy until the parent is nan
    while True:
        pid = parents[id_]
        if np.isnan(pid):
            break
        else:
            path.append(pid)
            id_ = pid

    # The path to the child node is the reverse of
    # the path we traveled
    paths[int(child_id)] = np.array(path[::-1], dtype="int")

並構建結果數據框:

result = pd.DataFrame({
    id_: (
        " / ".join(df.loc[pids, "name"]),
        " / ".join(pids.astype("str"))
    )
    for id_, pids in paths.items()
}, index=["name_path", "id_path"]).T

可以使用networkx來解析根節點和葉節點之間的路徑, all_simple_paths function。

# Python env: pip install networkx
# Anaconda env: conda install networkx
import networkx as nx

# Create network from your dataframe
G = nx.from_pandas_edgelist(df, source='parent_id', target='id',
                            create_using=nx.DiGraph)
nx.set_node_attributes(G, df.set_index('id')[['name']].to_dict('index'))

# Find roots of your graph (a root is a node with no input)
roots = [node for node, degree in G.in_degree() if degree == 0]

# Find leaves of your graph (a leaf is a node with no output)
leaves = [node for node, degree in G.out_degree() if degree == 0]

# Find all paths
paths = []
for root in roots:
  for leaf in leaves:
    for path in nx.all_simple_paths(G, root, leaf):
        # [1:] to remove NaN parent_id
        paths.append({'id_path': ' / '.join(str(n) for n in path[1:]),
                      'name_path': ' / '.join(G.nodes[n]['name'] for n in path[1:])})

out = pd.DataFrame(paths)

Output:

>>> out
                       id_path              name_path
0  1 / 11 / 111 / 1111 / 11111  b / b1 / b2 / b3 / b4
1          2 / 22 / 222 / 2222       a / a1 / a2 / a3

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM