从大型 JSON 文件创建树状结构的最有效方法

Question

我有一个大的 JSON 文件，如下所示：

[(3, (2, 'Child')), (2, (1, 'Parent')), (1, (None, 'Root'))]

其中每个元素的键是该元素的唯一索引，值对中的第一个元素表示其父元素的索引。

现在，最终目标是将这个 JSON 文件转换为以下内容：

[(3, (2, 'Child Parent Root')), (2, (1, 'Parent Root')), (1, (None, 'Root'))]

其中每个项目的值对中的第二个元素将被修改，以便它具有直到其根祖先的所有值的串联。

没有。 级别不固定，最多可达 256。我知道我可以通过创建树 DS 并遍历它来解决这个问题，但问题是 JSON 文件很大（列表中几乎有 1.8 亿个项目）。

关于如何有效地实现这一目标的任何想法？ 涉及 Apache Spark 的建议也可以。

Answer 1

您可以使用广度优先搜索来查找所有祖先元素链：

from collections import deque, defaultdict
d, d1 = [(3, (2, 'Child')), (2, (1, 'Parent')), (1, (None, 'Root'))], defaultdict(list)
for a, (b, c) in d:
   d1[b].append((a, c))

q, r = deque([(1, d1[None][0][1])]), {}
while q:
   r[n[0]] = (n:=q.popleft())[1]
   q.extend([(a, b+' '+n[1]) for a, b in d1[n[0]]])

现在， r存储每个元素的祖先值：

{1: 'Root', 2: 'Parent Root', 3: 'Child Parent Root'}

然后，使用列表推导更新d ：

result = [(a, (b, r[a])) for a, (b, _) in d]

Output：

[(3, (2, 'Child Parent Root')), (2, (1, 'Parent Root')), (1, (None, 'Root'))]

诸如 BFS 之类的迭代方法将消除在非常大的图上运行 DFS 时可能发生的RecursionError的可能性。

Answer 2

在 Spark 中，这是一个Graph问题，并使用Vertex Centric Programming解决。 不幸的是， GraphX没有与 python 兼容的 API。

另一种选择是使用graphframes 。

我在这里包含了一个使用连接的逻辑，它模仿了Vertex Centric Programming ，但没有使用任何库。 如果您可以将您拥有的数据表示形式转换为具有id 、 parent_id和name列的 dataframe 。

from pyspark.sql import functions as F
from pyspark.sql.functions import col as c
from pyspark.sql import SparkSession

spark = SparkSession.builder.master("local[1]").appName("nsankaranara").getOrCreate() 

data = [(6, 3, "Child_3", ),
        (5, 2, "Child_2", ),
        (4, 2, "Child_1", ),
        (3, 1, "Parent_2", ), 
        (2, 1, "Parent_1", ), 
        (1, None, "Root"), ]


df = spark.createDataFrame(data, ("id", "parent_id", "name", ))

df = (df.withColumn("mapped_parent_id", c("parent_id"))
        .withColumn("visited", F.lit(False)))

start_nodes = df.filter(c("mapped_parent_id").isNull())

# Controls how deep we want to traverse
max_iter = 256
iter_counter = 0

# Iteratively identify the next child and add your name to them
while iter_counter < max_iter:
    iter_counter += 1
    df = (df.alias("a").join(start_nodes.alias("b"), 
                            ((c("a.parent_id") == c("b.id")) | (c("a.id") == c("b.id"))), how="left_outer"))
    df = (df.select(c("a.id"), 
                  c("a.parent_id"),
                  (F.when((c("b.id").isNotNull() & (c("a.id") != c("b.id"))), F.lit(None)).otherwise(c("a.mapped_parent_id"))).alias("mapped_parent_id"),
                  F.when((c("a.id") != c("b.id")), F.concat_ws(" ", c("a.name"), c("b.name"))).otherwise(c("a.name")).alias("name"),
                  (F.when(c("a.id") == c("b.id"), F.lit(True)).otherwise(c("a.visited"))).alias("visited")
                  ))
    start_nodes = df.filter(((c("mapped_parent_id").isNull()) & (c("visited") == False)))
    if start_nodes.count() == 0:
        # signifies that all nodes have been visited
        break

df.select("id", "parent_id", "name").show(truncate = False)

Output

+---+---------+---------------------+
|id |parent_id|name                 |
+---+---------+---------------------+
|6  |3        |Child_3 Parent_2 Root|
|5  |2        |Child_2 Parent_1 Root|
|4  |2        |Child_1 Parent_1 Root|
|3  |1        |Parent_2 Root        |
|2  |1        |Parent_1 Root        |
|1  |null     |Root                 |
+---+---------+---------------------+

从大型 JSON 文件创建树状结构的最有效方法

问题描述

2 个解决方案

解决方案1
1 2021-12-14 04:18:56

解决方案2
0 2021-12-14 10:11:51

Output

从大型 JSON 文件创建树状结构的最有效方法

问题描述

2 个解决方案

解决方案1 1 2021-12-14 04:18:56

解决方案2 0 2021-12-14 10:11:51

Output

解决方案1
1 2021-12-14 04:18:56

解决方案2
0 2021-12-14 10:11:51