![](/img/trans.png)
[英]How to convert nested parent child hierarchy json to pandas dataframe?
[英]How can I transform columnar hierarchy into parent child list in Pandas?
我正在尝试使用 Pandas 库将使用具有固定列数(其中许多为空)的柱状格式的层次结构转换为具有子项和父项的邻接列表。
这是一个具有 5 个层级的虚构示例:
Books
/ | \
Science (null) (null)
/ | \
Astronomy (null) Pictures
/ \ | \
Astrophysics Cosmology (null) Astronomy
/ \ | / | \
(null) (null) Amateurs_Astronomy Galaxies Stars Astronauts
id,level_1,level_2,level_3,level_4,level_5
1,Books,Science,Astronomy,Astrophysics,
2,Books,Science,Astronomy,Cosmology,
3,Books,,,,Amateurs_Astronomy
4,Books,,Pictures,Astronomy,Galaxies
5,Books,,Pictures,Astronomy,Stars
6,Books,,Pictures,Astronomy,Astronauts
我首先添加一个列,该列将为每个现有节点存储一个 uuid。
[编辑,进一步到 mozway 评论]
这个 function 的问题在于它将为相同的节点填充不同的 uuid:
import pandas as pd
df = pd.read_csv('data.csv')
# iterate over each column in the dataframe to add a new column,
# containing a uuid each time the csv row has a value for this level:
for col in df.columns:
if df[col].isnull().sum() > 0:
new_col = 'pk_' + col
df[new_col] = None
# fill the new column with uuid only for non-null values of the original column
df.loc[df[col].notnull(), new_col] = df.loc[df[col].notnull(), col].apply(lambda x: uuid.uuid4())
另外,我不知道如何为每个节点找到父节点,跳过所有 null 节点。
关于如何获得以下结果的任何想法?
this_node,parent_node,this_node_uuid,parent_node_uuid
Science,Books,books/science-node-uuid,books-node-uuid
Astronomy,Science,books/science/astronomy-node-uuid,books/science-node-uuid
Astrophysics,Astronomy,books/science/astronomy/astrophysics-node-uuid,books/science/astronomy-node-uuid
Amateurs_Astronomy,Books,books/amateurs_astronomy-node-uuid,books-node-uuid
(…)
这是一种为每个值和级别生成 uuid 的方法,然后是邻接列表:
import uuid
from collections import defaultdict
mapper = defaultdict(uuid.uuid4)
df2 = (df.stack().reset_index(name='node')
.assign(uuid=lambda d: d.groupby(['level_1', 'node']).ngroup().map(mapper))
)
(df2[['node', 'uuid']]
.join(df2.groupby('id')[['node', 'uuid']].shift(-1).add_prefix('parent_'))
.dropna()
[['node', 'parent_node', 'uuid', 'parent_uuid']]
)
Output:
node parent_node uuid parent_uuid
0 Books Science 73299f14-db0b-49ac-8050-13ba909fbbf9 d5eabe29-9822-4cd5-832f-e7a69630ed1a
1 Science Astronomy d5eabe29-9822-4cd5-832f-e7a69630ed1a f72718d8-99d0-4160-ab2b-c4d990c103bc
2 Astronomy Astrophysics f72718d8-99d0-4160-ab2b-c4d990c103bc 03f6af50-df0f-4762-8791-3c06103dae62
4 Books Science 73299f14-db0b-49ac-8050-13ba909fbbf9 d5eabe29-9822-4cd5-832f-e7a69630ed1a
5 Science Astronomy d5eabe29-9822-4cd5-832f-e7a69630ed1a f72718d8-99d0-4160-ab2b-c4d990c103bc
6 Astronomy Cosmology f72718d8-99d0-4160-ab2b-c4d990c103bc 27de8aa5-5805-41f0-b127-e1c962328398
8 Books Amateurs_Astronomy 73299f14-db0b-49ac-8050-13ba909fbbf9 af5763c3-9f55-4815-88c8-3996bd2407db
10 Books Pictures 73299f14-db0b-49ac-8050-13ba909fbbf9 7cbc093c-b34c-4d45-8e38-24cc68b6ccc5
11 Pictures Astronomy 7cbc093c-b34c-4d45-8e38-24cc68b6ccc5 41bf967b-d6ca-4da7-b5ad-3ec05ceefd43
12 Astronomy Galaxies 41bf967b-d6ca-4da7-b5ad-3ec05ceefd43 68a8cb4f-def5-492d-b497-318a074a1f15
14 Books Pictures 73299f14-db0b-49ac-8050-13ba909fbbf9 7cbc093c-b34c-4d45-8e38-24cc68b6ccc5
15 Pictures Astronomy 7cbc093c-b34c-4d45-8e38-24cc68b6ccc5 41bf967b-d6ca-4da7-b5ad-3ec05ceefd43
16 Astronomy Stars 41bf967b-d6ca-4da7-b5ad-3ec05ceefd43 9d823bdd-fd3e-43a3-8756-51160490c8ed
18 Books Pictures 73299f14-db0b-49ac-8050-13ba909fbbf9 7cbc093c-b34c-4d45-8e38-24cc68b6ccc5
19 Pictures Astronomy 7cbc093c-b34c-4d45-8e38-24cc68b6ccc5 41bf967b-d6ca-4da7-b5ad-3ec05ceefd43
20 Astronomy Astronauts 41bf967b-d6ca-4da7-b5ad-3ec05ceefd43 609e708f-60cd-4928-863c-d41255330981
import networkx as nx
G = nx.from_pandas_edgelist(out, source='uuid', target='parent_uuid', create_using=nx.DiGraph)
nx.set_node_attributes(G, {k: v for (_, v), k in mapper.items()}, name='label')
从这里,你如何生成你的 uuid?
def build_hierarchy(df):
return pd.concat([df.shift(-1), df], keys=['node', 'parent'], axis=1)
out = (df.set_index('id').stack()
.groupby(level='id', group_keys=False).apply(build_hierarchy)
.droplevel(1).reset_index())
Output:
>>> out
id node parent
0 1 Science Books
1 1 Astronomy Science
2 1 Astrophysics Astronomy
3 1 None Astrophysics
4 2 Science Books
5 2 Astronomy Science
6 2 Cosmology Astronomy
7 2 None Cosmology
8 3 Amateurs_Astronomy Books
9 3 None Amateurs_Astronomy
10 4 Pictures Books
11 4 Astronomy Pictures
12 4 Galaxies Astronomy
13 4 None Galaxies
14 5 Pictures Books
15 5 Astronomy Pictures
16 5 Stars Astronomy
17 5 None Stars
18 6 Pictures Books
19 6 Astronomy Pictures
20 6 Astronauts Astronomy
21 6 None Astronauts
def function1(ss:pd.Series):
return ss.tolist() if ss.size>1 else None
df11=df1.set_index('id').apply(lambda ss:pd.Series(ss.dropna().rolling(2,2))
.apply(function1).dropna().tolist(),axis=1)\
.explode().drop_duplicates()
df12=pd.DataFrame(df11.tolist(),columns=['node','parent_node'])
df12.assign(uuid=df12.node.map(id)).assign(parent_uuid=df12.parent_node.map(id))
out:
node parent_node uuid parent_uuid
0 Books Science 2437636899760 2437636912432
1 Science Astronomy 2437636912432 2437636913072
2 Astronomy Astrophysics 2437636913072 2437636914288
3 Astronomy Cosmology 2437636913072 2437636909360
4 Books Amateurs_Astronomy 2437636899760 2437649183760
5 Books Pictures 2437636899760 2437649161840
6 Pictures Astronomy 2437649161840 2437649163120
7 Astronomy Galaxies 2437649163120 2437649165552
8 Astronomy Stars 2437649163120 2437649167344
9 Astronomy Astronauts 2437649163120 2437649162864
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.