繁体   English   中英

如何在 Pandas 中将柱状层次结构转换为父子列表?

[英]How can I transform columnar hierarchy into parent child list in Pandas?

我正在尝试使用 Pandas 库将使用具有固定列数(其中许多为空)的柱状格式的层次结构转换为具有子项和父项的邻接列表。

示例层次结构

这是一个具有 5 个层级的虚构示例:

                         Books
                     /     |     \
             Science     (null)      (null)
               /           |           \
      Astronomy          (null)          Pictures       
         /  \              |                      \
Astrophysics Cosmology   (null)                    Astronomy
      /         \          |                       /    |    \
  (null)        (null)   Amateurs_Astronomy   Galaxies Stars Astronauts

数据.csv

id,level_1,level_2,level_3,level_4,level_5
1,Books,Science,Astronomy,Astrophysics,
2,Books,Science,Astronomy,Cosmology,
3,Books,,,,Amateurs_Astronomy
4,Books,,Pictures,Astronomy,Galaxies
5,Books,,Pictures,Astronomy,Stars
6,Books,,Pictures,Astronomy,Astronauts

我做了什么

我首先添加一个列,该列将为每个现有节点存储一个 uuid。

[编辑,进一步到 mozway 评论]

这个 function 的问题在于它将为相同的节点填充不同的 uuid:

  • 第一行和第二行具有相同的级别 1、2、3,因此应该具有与 pk_level_3 相同的 uuid
  • 同样,第 4、5 和 6 行应具有与 pk_level_3 和 pk_level_4 相同的 uuid。
import pandas as pd

df = pd.read_csv('data.csv')

# iterate over each column in the dataframe to add a new column,
# containing a uuid each time the csv row has a value for this level:
for col in df.columns:
    if df[col].isnull().sum() > 0:
        new_col = 'pk_' + col
        df[new_col] = None
        # fill the new column with uuid only for non-null values of the original column
        df.loc[df[col].notnull(), new_col] = df.loc[df[col].notnull(), col].apply(lambda x: uuid.uuid4())

另外,我不知道如何为每个节点找到父节点,跳过所有 null 节点。

关于如何获得以下结果的任何想法?

this_node,parent_node,this_node_uuid,parent_node_uuid
Science,Books,books/science-node-uuid,books-node-uuid
Astronomy,Science,books/science/astronomy-node-uuid,books/science-node-uuid
Astrophysics,Astronomy,books/science/astronomy/astrophysics-node-uuid,books/science/astronomy-node-uuid
Amateurs_Astronomy,Books,books/amateurs_astronomy-node-uuid,books-node-uuid

(…)

这是一种为每个值和级别生成 uuid 的方法,然后是邻接列表:

import uuid
from collections import defaultdict

mapper = defaultdict(uuid.uuid4)

df2 = (df.stack().reset_index(name='node')
         .assign(uuid=lambda d: d.groupby(['level_1', 'node']).ngroup().map(mapper))
      )
       
(df2[['node', 'uuid']]
 .join(df2.groupby('id')[['node', 'uuid']].shift(-1).add_prefix('parent_'))
 .dropna()
 [['node', 'parent_node', 'uuid', 'parent_uuid']]
)

Output:

         node         parent_node                                  uuid                           parent_uuid
0       Books             Science  73299f14-db0b-49ac-8050-13ba909fbbf9  d5eabe29-9822-4cd5-832f-e7a69630ed1a
1     Science           Astronomy  d5eabe29-9822-4cd5-832f-e7a69630ed1a  f72718d8-99d0-4160-ab2b-c4d990c103bc
2   Astronomy        Astrophysics  f72718d8-99d0-4160-ab2b-c4d990c103bc  03f6af50-df0f-4762-8791-3c06103dae62
4       Books             Science  73299f14-db0b-49ac-8050-13ba909fbbf9  d5eabe29-9822-4cd5-832f-e7a69630ed1a
5     Science           Astronomy  d5eabe29-9822-4cd5-832f-e7a69630ed1a  f72718d8-99d0-4160-ab2b-c4d990c103bc
6   Astronomy           Cosmology  f72718d8-99d0-4160-ab2b-c4d990c103bc  27de8aa5-5805-41f0-b127-e1c962328398
8       Books  Amateurs_Astronomy  73299f14-db0b-49ac-8050-13ba909fbbf9  af5763c3-9f55-4815-88c8-3996bd2407db
10      Books            Pictures  73299f14-db0b-49ac-8050-13ba909fbbf9  7cbc093c-b34c-4d45-8e38-24cc68b6ccc5
11   Pictures           Astronomy  7cbc093c-b34c-4d45-8e38-24cc68b6ccc5  41bf967b-d6ca-4da7-b5ad-3ec05ceefd43
12  Astronomy            Galaxies  41bf967b-d6ca-4da7-b5ad-3ec05ceefd43  68a8cb4f-def5-492d-b497-318a074a1f15
14      Books            Pictures  73299f14-db0b-49ac-8050-13ba909fbbf9  7cbc093c-b34c-4d45-8e38-24cc68b6ccc5
15   Pictures           Astronomy  7cbc093c-b34c-4d45-8e38-24cc68b6ccc5  41bf967b-d6ca-4da7-b5ad-3ec05ceefd43
16  Astronomy               Stars  41bf967b-d6ca-4da7-b5ad-3ec05ceefd43  9d823bdd-fd3e-43a3-8756-51160490c8ed
18      Books            Pictures  73299f14-db0b-49ac-8050-13ba909fbbf9  7cbc093c-b34c-4d45-8e38-24cc68b6ccc5
19   Pictures           Astronomy  7cbc093c-b34c-4d45-8e38-24cc68b6ccc5  41bf967b-d6ca-4da7-b5ad-3ec05ceefd43
20  Astronomy          Astronauts  41bf967b-d6ca-4da7-b5ad-3ec05ceefd43  609e708f-60cd-4928-863c-d41255330981

图形

import networkx as nx
G = nx.from_pandas_edgelist(out, source='uuid', target='parent_uuid', create_using=nx.DiGraph)
nx.set_node_attributes(G, {k: v for (_, v), k in mapper.items()}, name='label')

在此处输入图像描述

从这里,你如何生成你的 uuid?

def build_hierarchy(df):
    return pd.concat([df.shift(-1), df], keys=['node', 'parent'], axis=1)

out = (df.set_index('id').stack()
         .groupby(level='id', group_keys=False).apply(build_hierarchy)
         .droplevel(1).reset_index())

Output:

>>> out
    id                node              parent
0    1             Science               Books
1    1           Astronomy             Science
2    1        Astrophysics           Astronomy
3    1                None        Astrophysics
4    2             Science               Books
5    2           Astronomy             Science
6    2           Cosmology           Astronomy
7    2                None           Cosmology
8    3  Amateurs_Astronomy               Books
9    3                None  Amateurs_Astronomy
10   4            Pictures               Books
11   4           Astronomy            Pictures
12   4            Galaxies           Astronomy
13   4                None            Galaxies
14   5            Pictures               Books
15   5           Astronomy            Pictures
16   5               Stars           Astronomy
17   5                None               Stars
18   6            Pictures               Books
19   6           Astronomy            Pictures
20   6          Astronauts           Astronomy
21   6                None          Astronauts
def function1(ss:pd.Series):
    return ss.tolist() if ss.size>1 else None

df11=df1.set_index('id').apply(lambda ss:pd.Series(ss.dropna().rolling(2,2))
                          .apply(function1).dropna().tolist(),axis=1)\
    .explode().drop_duplicates()
df12=pd.DataFrame(df11.tolist(),columns=['node','parent_node'])
df12.assign(uuid=df12.node.map(id)).assign(parent_uuid=df12.parent_node.map(id))

out:

       node         parent_node           uuid    parent_uuid
0      Books             Science  2437636899760  2437636912432
1    Science           Astronomy  2437636912432  2437636913072
2  Astronomy        Astrophysics  2437636913072  2437636914288
3  Astronomy           Cosmology  2437636913072  2437636909360
4      Books  Amateurs_Astronomy  2437636899760  2437649183760
5      Books            Pictures  2437636899760  2437649161840
6   Pictures           Astronomy  2437649161840  2437649163120
7  Astronomy            Galaxies  2437649163120  2437649165552
8  Astronomy               Stars  2437649163120  2437649167344
9  Astronomy          Astronauts  2437649163120  2437649162864

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM