从大型 JSON 文件创建树状结构的最有效方法

Question

I have a large JSON file like the following:我有一个大的 JSON 文件，如下所示：

[(3, (2, 'Child')), (2, (1, 'Parent')), (1, (None, 'Root'))]

where the key of each element is a unique index for that element and 1st element in the value pair signifies the index of its parent element.其中每个元素的键是该元素的唯一索引，值对中的第一个元素表示其父元素的索引。

Now, the ultimate goal is to convert this JSON file into the following:现在，最终目标是将这个 JSON 文件转换为以下内容：

[(3, (2, 'Child Parent Root')), (2, (1, 'Parent Root')), (1, (None, 'Root'))]

where the 2nd element in the value pair for each item will be modified such that it has the concatenation of all the values up to its root ancestor.其中每个项目的值对中的第二个元素将被修改，以便它具有直到其根祖先的所有值的串联。

The no.没有。 of levels is not fixed and can be up to 256. I know I can solve this problem by creating a tree DS and traversing it but the problem is the JSON file is huge (almost 180M items in the list).级别不固定，最多可达 256。我知道我可以通过创建树 DS 并遍历它来解决这个问题，但问题是 JSON 文件很大（列表中几乎有 1.8 亿个项目）。

Any idea on how can I achieve this efficiently?关于如何有效地实现这一目标的任何想法？ Suggestions involving Apache Spark would be fine as well.涉及 Apache Spark 的建议也可以。

Answer 1

You can use a breadth-first search to find all ancestor element chains:您可以使用广度优先搜索来查找所有祖先元素链：

from collections import deque, defaultdict
d, d1 = [(3, (2, 'Child')), (2, (1, 'Parent')), (1, (None, 'Root'))], defaultdict(list)
for a, (b, c) in d:
   d1[b].append((a, c))

q, r = deque([(1, d1[None][0][1])]), {}
while q:
   r[n[0]] = (n:=q.popleft())[1]
   q.extend([(a, b+' '+n[1]) for a, b in d1[n[0]]])

Now, r stores the ancestor values for each element:现在， r存储每个元素的祖先值：

{1: 'Root', 2: 'Parent Root', 3: 'Child Parent Root'}

Then, using a list comprehension to update d :然后，使用列表推导更新d ：

result = [(a, (b, r[a])) for a, (b, _) in d]

Output: Output：

[(3, (2, 'Child Parent Root')), (2, (1, 'Parent Root')), (1, (None, 'Root'))]

An iterative approach such as BFS will eliminate the possibility of a RecursionError which might occur when running DFS on a very large graph.诸如 BFS 之类的迭代方法将消除在非常大的图上运行 DFS 时可能发生的RecursionError的可能性。

Answer 2

In Spark this is treated a Graph problem and solved using Vertex Centric Programming .在 Spark 中，这是一个Graph问题，并使用Vertex Centric Programming解决。 Unfortunately, GraphX does not have a python compatible API.不幸的是， GraphX没有与 python 兼容的 API。

Another option is to use graphframes .另一种选择是使用graphframes 。

I have included here a logic using joins which mimics Vertex Centric Programming but without using any libraries.我在这里包含了一个使用连接的逻辑，它模仿了Vertex Centric Programming ，但没有使用任何库。 Provided you can convert the data representation you have into a dataframe with id , parent_id and name column.如果您可以将您拥有的数据表示形式转换为具有id 、 parent_id和name列的 dataframe 。

from pyspark.sql import functions as F
from pyspark.sql.functions import col as c
from pyspark.sql import SparkSession

spark = SparkSession.builder.master("local[1]").appName("nsankaranara").getOrCreate() 

data = [(6, 3, "Child_3", ),
        (5, 2, "Child_2", ),
        (4, 2, "Child_1", ),
        (3, 1, "Parent_2", ), 
        (2, 1, "Parent_1", ), 
        (1, None, "Root"), ]


df = spark.createDataFrame(data, ("id", "parent_id", "name", ))

df = (df.withColumn("mapped_parent_id", c("parent_id"))
        .withColumn("visited", F.lit(False)))

start_nodes = df.filter(c("mapped_parent_id").isNull())

# Controls how deep we want to traverse
max_iter = 256
iter_counter = 0

# Iteratively identify the next child and add your name to them
while iter_counter < max_iter:
    iter_counter += 1
    df = (df.alias("a").join(start_nodes.alias("b"), 
                            ((c("a.parent_id") == c("b.id")) | (c("a.id") == c("b.id"))), how="left_outer"))
    df = (df.select(c("a.id"), 
                  c("a.parent_id"),
                  (F.when((c("b.id").isNotNull() & (c("a.id") != c("b.id"))), F.lit(None)).otherwise(c("a.mapped_parent_id"))).alias("mapped_parent_id"),
                  F.when((c("a.id") != c("b.id")), F.concat_ws(" ", c("a.name"), c("b.name"))).otherwise(c("a.name")).alias("name"),
                  (F.when(c("a.id") == c("b.id"), F.lit(True)).otherwise(c("a.visited"))).alias("visited")
                  ))
    start_nodes = df.filter(((c("mapped_parent_id").isNull()) & (c("visited") == False)))
    if start_nodes.count() == 0:
        # signifies that all nodes have been visited
        break

df.select("id", "parent_id", "name").show(truncate = False)

Output Output

+---+---------+---------------------+
|id |parent_id|name                 |
+---+---------+---------------------+
|6  |3        |Child_3 Parent_2 Root|
|5  |2        |Child_2 Parent_1 Root|
|4  |2        |Child_1 Parent_1 Root|
|3  |1        |Parent_2 Root        |
|2  |1        |Parent_1 Root        |
|1  |null     |Root                 |
+---+---------+---------------------+

从大型 JSON 文件创建树状结构的最有效方法

问题描述

2 个解决方案

解决方案1
1 2021-12-14 04:18:56

解决方案2
0 2021-12-14 10:11:51

Output Output

从大型 JSON 文件创建树状结构的最有效方法

问题描述

2 个解决方案

解决方案1 1 2021-12-14 04:18:56

解决方案2 0 2021-12-14 10:11:51

Output Output

解决方案1
1 2021-12-14 04:18:56

解决方案2
0 2021-12-14 10:11:51