将Pyspark Dataframe中的数据导出到Dictionary或List中以进一步处理Python

Question

I am trying to retrieve values from a Pyspark Dataframe after Pyspark does work to find Connected Components, but I don't understand how to extract that data as you would, say, from a list. 我试图在Pyspark确实找到连接组件之后从Pyspark Dataframe中检索值，但我不知道如何从列表中提取数据。

Below is a simplified version of the table created from the large dataset I'm working from. 下面是从我正在使用的大型数据集创建的表的简化版本。 Essentially, the following table is created by using connectivity data on vertices and edges of graphs. 实质上，通过使用图的顶点和边的连接数据来创建下表。 If a component number is the same, it means that the the nodes (ids) lie in the same graph structure. 如果组件编号相同，则表示节点（ID）位于相同的图形结构中。


    +---+------------+
    | id|   component|
    +---+------------+
    |  0|154618822656|
    |  1|154618822656|
    |  2|154618822656|
    |  3|154618822656|
    |  4|420906795008|
    |  5|420906795008|
    +---+------------+

I've tried a lot of things to extract the data into forms I am most used to like lists and dictionaries. 我已经尝试了很多东西来将数据提取到我最常用的列表和词典中。 When I try various methods in the docs, I get outputs like: 当我在文档中尝试各种方法时，我得到如下输出：

[Row(id='0', component=154618822656), Row(id='1', component=154618822656)]

which I'm not sure how to work. 我不知道该怎么做。 I've also seen an asDict() method in Pyspark but I cannot get it to work on even a simple table. 我在Pyspark中也看到了一个asDict（）方法，但即使是一个简单的表，也无法让它工作。

This is an example function that takes the graphframe, finds the connected components and creates a table. 这是一个示例函数，它接受graphframe，查找连接的组件并创建表。 All is well until I want to put the data in another structure: 一切都很好，直到我想将数据放在另一个结构中：

def get_connected_components(graphframe):
    connected_table = g.connectedComponents()
    connected_table.collect()
    conn = connected_table.rdd.take(2)
    print(conn)

I'd ultimately like to have something like this: 我最终想要这样的东西：

{"154618822656" : {0, 1}, "420906795008": {2, 3, 4, 5}}

which I would turn into a further output like: 我会变成另一个输出，如：

0 1
2 3 4 5

This may be the wrong route on how to operate these tables but I'm brand new to Pyspark and surprised at how tricky this is even with all the searching. 这可能是如何操作这些表格的错误路线，但我对Pyspark来说是全新的，并且惊讶于即使在所有搜索中这也是多么棘手。 Thank you in advance. 先感谢您。

Answer 1

Not entirely sure what you are trying to do, but here are some methods on dictionary and list conversion via Spark that should help. 不完全确定你要做什么，但这里有一些关于字典和列表转换的方法，通过Spark应该有所帮助。 One very important thing to note is if you want to use structures like list/dict then I suggest working on a single machine (if your data set can fit into memory) rather than trying to distribute computation via Spark only to collect all the data back to a single machine to do more processing. 需要注意的一件非常重要的事情是，如果你想使用像list / dict这样的结构，那么我建议你在一台机器上工作（如果你的数据集适合内存），而不是试图通过Spark分配计算只收集所有数据到一台机器做更多的处理。 There are some nice single machine Python graph packages too since you are working with Spark GraphFrames. 由于您正在使用Spark GraphFrames，因此还有一些不错的单机Python图形包。 Hope this helps. 希望这可以帮助。

# load your sample data set
data = [(0, 154618822656),\
        (1, 154618822656),\
        (2, 154618822656),\
        (3, 154618822656),\
        (4, 420906795008),\
        (5, 420906795008),]

df = spark.createDataFrame(data, ("id", "comp"))

df.show()

+---+------------+
| id|        comp|
+---+------------+
|  0|154618822656|
|  1|154618822656|
|  2|154618822656|
|  3|154618822656|
|  4|420906795008|
|  5|420906795008|
+---+------------+

# get desired format like {"154618822656" : {0, 1}, "420906795008": {2, 3, 4, 5}} from your post
from pyspark.sql.functions import collect_list

df.groupBy("comp").agg(collect_list("id").alias("id")).show()
+------------+------------+
|        comp|          id|
+------------+------------+
|154618822656|[0, 1, 2, 3]|
|420906795008|      [4, 5]|
+------------+------------+

# you can convert col to a list ***collect() is not recommended for larger datasets***
l = [i for i in df.select("id").rdd.flatMap(lambda x: x).collect()]

print(type(l))
print(l)
<class 'list'>
[0, 1, 2, 3, 4, 5]

# write to json so you can get a dictionary format like you were mentioning
df.groupBy("comp").agg(collect_list("id").alias("id")).write.json("data.json")

! cat data.json/*.json
{"comp":154618822656,"id":[0,1,2,3]}
{"comp":420906795008,"id":[4,5]}

将Pyspark Dataframe中的数据导出到Dictionary或List中以进一步处理Python

问题描述

1 个解决方案

解决方案1
0 2019-05-06 02:38:04

将Pyspark Dataframe中的数据导出到Dictionary或List中以进一步处理Python

问题描述

1 个解决方案

解决方案1 0 2019-05-06 02:38:04

解决方案1
0 2019-05-06 02:38:04