简体   繁体   English

从大型 JSON 文件创建树状结构的最有效方法

[英]Most efficient way to create a tree like structure from a large JSON file

I have a large JSON file like the following:我有一个大的 JSON 文件,如下所示:

[(3, (2, 'Child')), (2, (1, 'Parent')), (1, (None, 'Root'))]

where the key of each element is a unique index for that element and 1st element in the value pair signifies the index of its parent element.其中每个元素的键是该元素的唯一索引,值对中的第一个元素表示其父元素的索引。

Now, the ultimate goal is to convert this JSON file into the following:现在,最终目标是将这个 JSON 文件转换为以下内容:

[(3, (2, 'Child Parent Root')), (2, (1, 'Parent Root')), (1, (None, 'Root'))]

where the 2nd element in the value pair for each item will be modified such that it has the concatenation of all the values up to its root ancestor.其中每个项目的值对中的第二个元素将被修改,以便它具有直到其根祖先的所有值的串联。

The no.没有。 of levels is not fixed and can be up to 256. I know I can solve this problem by creating a tree DS and traversing it but the problem is the JSON file is huge (almost 180M items in the list).级别不固定,最多可达 256。我知道我可以通过创建树 DS 并遍历它来解决这个问题,但问题是 JSON 文件很大(列表中几乎有 1.8 亿个项目)。

Any idea on how can I achieve this efficiently?关于如何有效地实现这一目标的任何想法? Suggestions involving Apache Spark would be fine as well.涉及 Apache Spark 的建议也可以。

You can use a breadth-first search to find all ancestor element chains:您可以使用广度优先搜索来查找所有祖先元素链:

from collections import deque, defaultdict
d, d1 = [(3, (2, 'Child')), (2, (1, 'Parent')), (1, (None, 'Root'))], defaultdict(list)
for a, (b, c) in d:
   d1[b].append((a, c))

q, r = deque([(1, d1[None][0][1])]), {}
while q:
   r[n[0]] = (n:=q.popleft())[1]
   q.extend([(a, b+' '+n[1]) for a, b in d1[n[0]]])

Now, r stores the ancestor values for each element:现在, r存储每个元素的祖先值:

{1: 'Root', 2: 'Parent Root', 3: 'Child Parent Root'}

Then, using a list comprehension to update d :然后,使用列表推导更新d

result = [(a, (b, r[a])) for a, (b, _) in d]

Output: Output:

[(3, (2, 'Child Parent Root')), (2, (1, 'Parent Root')), (1, (None, 'Root'))]

An iterative approach such as BFS will eliminate the possibility of a RecursionError which might occur when running DFS on a very large graph.诸如 BFS 之类的迭代方法将消除在非常大的图上运行 DFS 时可能发生的RecursionError的可能性。

In Spark this is treated a Graph problem and solved using Vertex Centric Programming .在 Spark 中,这是一个Graph问题,并使用Vertex Centric Programming解决。 Unfortunately, GraphX does not have a python compatible API.不幸的是, GraphX没有与 python 兼容的 API。

Another option is to use graphframes .另一种选择是使用graphframes

I have included here a logic using joins which mimics Vertex Centric Programming but without using any libraries.我在这里包含了一个使用连接的逻辑,它模仿了Vertex Centric Programming ,但没有使用任何库。 Provided you can convert the data representation you have into a dataframe with id , parent_id and name column.如果您可以将您拥有的数据表示形式转换为具有idparent_idname列的 dataframe 。

from pyspark.sql import functions as F
from pyspark.sql.functions import col as c
from pyspark.sql import SparkSession

spark = SparkSession.builder.master("local[1]").appName("nsankaranara").getOrCreate() 

data = [(6, 3, "Child_3", ),
        (5, 2, "Child_2", ),
        (4, 2, "Child_1", ),
        (3, 1, "Parent_2", ), 
        (2, 1, "Parent_1", ), 
        (1, None, "Root"), ]


df = spark.createDataFrame(data, ("id", "parent_id", "name", ))

df = (df.withColumn("mapped_parent_id", c("parent_id"))
        .withColumn("visited", F.lit(False)))

start_nodes = df.filter(c("mapped_parent_id").isNull())

# Controls how deep we want to traverse
max_iter = 256
iter_counter = 0

# Iteratively identify the next child and add your name to them
while iter_counter < max_iter:
    iter_counter += 1
    df = (df.alias("a").join(start_nodes.alias("b"), 
                            ((c("a.parent_id") == c("b.id")) | (c("a.id") == c("b.id"))), how="left_outer"))
    df = (df.select(c("a.id"), 
                  c("a.parent_id"),
                  (F.when((c("b.id").isNotNull() & (c("a.id") != c("b.id"))), F.lit(None)).otherwise(c("a.mapped_parent_id"))).alias("mapped_parent_id"),
                  F.when((c("a.id") != c("b.id")), F.concat_ws(" ", c("a.name"), c("b.name"))).otherwise(c("a.name")).alias("name"),
                  (F.when(c("a.id") == c("b.id"), F.lit(True)).otherwise(c("a.visited"))).alias("visited")
                  ))
    start_nodes = df.filter(((c("mapped_parent_id").isNull()) & (c("visited") == False)))
    if start_nodes.count() == 0:
        # signifies that all nodes have been visited
        break

df.select("id", "parent_id", "name").show(truncate = False)

Output Output

+---+---------+---------------------+
|id |parent_id|name                 |
+---+---------+---------------------+
|6  |3        |Child_3 Parent_2 Root|
|5  |2        |Child_2 Parent_1 Root|
|4  |2        |Child_1 Parent_1 Root|
|3  |1        |Parent_2 Root        |
|2  |1        |Parent_1 Root        |
|1  |null     |Root                 |
+---+---------+---------------------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在Python中从JSON文件创建DataFrame的最有效方法是什么? - What is the most efficient way to create a DataFrame from a JSON file in Python? 从大型 JSON 文件获取数据到 SQL 服务器的最有效方法是什么? - What is the most efficient way to get data from large JSON file into SQL Server? 在 django 中以树状结构获取所有相关模型的最有效方法是什么? - Whats the most efficient way to get all related models in a tree-like structure in django? 最有效的遍历文件结构Python的方法 - Most efficient way to traverse file structure Python 将数据帧编码为树状嵌套 if 结构的有效方法? - Efficient way to encode a dataframe into a tree-like, nested-if structure? 有没有更有效的方法从大文本文件创建倒排索引? - Is there a more efficient way to create an inverted index from a large text file? 读取大型二进制文件python的最有效方法是什么 - What is the most efficient way to read a large binary file python 在Python中修改大型文本文件的最后一行的最有效方法 - Most efficient way to modify the last line of a large text file in Python 搜索大型排序文本文件的最快和最有效的方法 - Quickest and most efficient way to search large sorted text file 读取大型 csv 文件中特定列的最有效方法 - Most efficient way to read a specific column in large csv file
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM