[英]Most efficient way to create a tree like structure from a large JSON file
I have a large JSON file like the following:我有一个大的 JSON 文件,如下所示:
[(3, (2, 'Child')), (2, (1, 'Parent')), (1, (None, 'Root'))]
where the key of each element is a unique index for that element and 1st element in the value pair signifies the index of its parent element.其中每个元素的键是该元素的唯一索引,值对中的第一个元素表示其父元素的索引。
Now, the ultimate goal is to convert this JSON file into the following:现在,最终目标是将这个 JSON 文件转换为以下内容:
[(3, (2, 'Child Parent Root')), (2, (1, 'Parent Root')), (1, (None, 'Root'))]
where the 2nd element in the value pair for each item will be modified such that it has the concatenation of all the values up to its root ancestor.其中每个项目的值对中的第二个元素将被修改,以便它具有直到其根祖先的所有值的串联。
The no.没有。 of levels is not fixed and can be up to 256. I know I can solve this problem by creating a tree DS and traversing it but the problem is the JSON file is huge (almost 180M items in the list).
级别不固定,最多可达 256。我知道我可以通过创建树 DS 并遍历它来解决这个问题,但问题是 JSON 文件很大(列表中几乎有 1.8 亿个项目)。
Any idea on how can I achieve this efficiently?关于如何有效地实现这一目标的任何想法? Suggestions involving Apache Spark would be fine as well.
涉及 Apache Spark 的建议也可以。
You can use a breadth-first search to find all ancestor element chains:您可以使用广度优先搜索来查找所有祖先元素链:
from collections import deque, defaultdict
d, d1 = [(3, (2, 'Child')), (2, (1, 'Parent')), (1, (None, 'Root'))], defaultdict(list)
for a, (b, c) in d:
d1[b].append((a, c))
q, r = deque([(1, d1[None][0][1])]), {}
while q:
r[n[0]] = (n:=q.popleft())[1]
q.extend([(a, b+' '+n[1]) for a, b in d1[n[0]]])
Now, r
stores the ancestor values for each element:现在,
r
存储每个元素的祖先值:
{1: 'Root', 2: 'Parent Root', 3: 'Child Parent Root'}
Then, using a list comprehension to update d
:然后,使用列表推导更新
d
:
result = [(a, (b, r[a])) for a, (b, _) in d]
Output: Output:
[(3, (2, 'Child Parent Root')), (2, (1, 'Parent Root')), (1, (None, 'Root'))]
An iterative approach such as BFS will eliminate the possibility of a RecursionError
which might occur when running DFS on a very large graph.诸如 BFS 之类的迭代方法将消除在非常大的图上运行 DFS 时可能发生的
RecursionError
的可能性。
In Spark this is treated a Graph
problem and solved using Vertex Centric Programming
.在 Spark 中,这是一个
Graph
问题,并使用Vertex Centric Programming
解决。 Unfortunately, GraphX
does not have a python compatible API.不幸的是,
GraphX
没有与 python 兼容的 API。
Another option is to use graphframes
.另一种选择是使用
graphframes
。
I have included here a logic using joins which mimics Vertex Centric Programming
but without using any libraries.我在这里包含了一个使用连接的逻辑,它模仿了
Vertex Centric Programming
,但没有使用任何库。 Provided you can convert the data representation you have into a dataframe with id
, parent_id
and name
column.如果您可以将您拥有的数据表示形式转换为具有
id
、 parent_id
和name
列的 dataframe 。
from pyspark.sql import functions as F
from pyspark.sql.functions import col as c
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[1]").appName("nsankaranara").getOrCreate()
data = [(6, 3, "Child_3", ),
(5, 2, "Child_2", ),
(4, 2, "Child_1", ),
(3, 1, "Parent_2", ),
(2, 1, "Parent_1", ),
(1, None, "Root"), ]
df = spark.createDataFrame(data, ("id", "parent_id", "name", ))
df = (df.withColumn("mapped_parent_id", c("parent_id"))
.withColumn("visited", F.lit(False)))
start_nodes = df.filter(c("mapped_parent_id").isNull())
# Controls how deep we want to traverse
max_iter = 256
iter_counter = 0
# Iteratively identify the next child and add your name to them
while iter_counter < max_iter:
iter_counter += 1
df = (df.alias("a").join(start_nodes.alias("b"),
((c("a.parent_id") == c("b.id")) | (c("a.id") == c("b.id"))), how="left_outer"))
df = (df.select(c("a.id"),
c("a.parent_id"),
(F.when((c("b.id").isNotNull() & (c("a.id") != c("b.id"))), F.lit(None)).otherwise(c("a.mapped_parent_id"))).alias("mapped_parent_id"),
F.when((c("a.id") != c("b.id")), F.concat_ws(" ", c("a.name"), c("b.name"))).otherwise(c("a.name")).alias("name"),
(F.when(c("a.id") == c("b.id"), F.lit(True)).otherwise(c("a.visited"))).alias("visited")
))
start_nodes = df.filter(((c("mapped_parent_id").isNull()) & (c("visited") == False)))
if start_nodes.count() == 0:
# signifies that all nodes have been visited
break
df.select("id", "parent_id", "name").show(truncate = False)
+---+---------+---------------------+
|id |parent_id|name |
+---+---------+---------------------+
|6 |3 |Child_3 Parent_2 Root|
|5 |2 |Child_2 Parent_1 Root|
|4 |2 |Child_1 Parent_1 Root|
|3 |1 |Parent_2 Root |
|2 |1 |Parent_1 Root |
|1 |null |Root |
+---+---------+---------------------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.