简体   繁体   English

将Pyspark Dataframe中的数据导出到Dictionary或List中以进一步处理Python

[英]Export data from Pyspark Dataframe into Dictionary or List for further processing Python

I am trying to retrieve values from a Pyspark Dataframe after Pyspark does work to find Connected Components, but I don't understand how to extract that data as you would, say, from a list. 我试图在Pyspark确实找到连接组件之后从Pyspark Dataframe中检索值,但我不知道如何从列表中提取数据。

Below is a simplified version of the table created from the large dataset I'm working from. 下面是从我正在使用的大型数据集创建的表的简化版本。 Essentially, the following table is created by using connectivity data on vertices and edges of graphs. 实质上,通过使用图的顶点和边的连接数据来创建下表。 If a component number is the same, it means that the the nodes (ids) lie in the same graph structure. 如果组件编号相同,则表示节点(ID)位于相同的图形结构中。


    +---+------------+
    | id|   component|
    +---+------------+
    |  0|154618822656|
    |  1|154618822656|
    |  2|154618822656|
    |  3|154618822656|
    |  4|420906795008|
    |  5|420906795008|
    +---+------------+

I've tried a lot of things to extract the data into forms I am most used to like lists and dictionaries. 我已经尝试了很多东西来将数据提取到我最常用的列表和词典中。 When I try various methods in the docs, I get outputs like: 当我在文档中尝试各种方法时,我得到如下输出:

[Row(id='0', component=154618822656), Row(id='1', component=154618822656)]

which I'm not sure how to work. 我不知道该怎么做。 I've also seen an asDict() method in Pyspark but I cannot get it to work on even a simple table. 我在Pyspark中也看到了一个asDict()方法,但即使是一个简单的表,也无法让它工作。

This is an example function that takes the graphframe, finds the connected components and creates a table. 这是一个示例函数,它接受graphframe,查找连接的组件并创建表。 All is well until I want to put the data in another structure: 一切都很好,直到我想将数据放在另一个结构中:

def get_connected_components(graphframe):
    connected_table = g.connectedComponents()
    connected_table.collect()
    conn = connected_table.rdd.take(2)
    print(conn)

I'd ultimately like to have something like this: 我最终想要这样的东西:

{"154618822656" : {0, 1}, "420906795008": {2, 3, 4, 5}}

which I would turn into a further output like: 我会变成另一个输出,如:

0 1
2 3 4 5

This may be the wrong route on how to operate these tables but I'm brand new to Pyspark and surprised at how tricky this is even with all the searching. 这可能是如何操作这些表格的错误路线,但我对Pyspark来说是全新的,并且惊讶于即使在所有搜索中这也是多么棘手。 Thank you in advance. 先感谢您。

Not entirely sure what you are trying to do, but here are some methods on dictionary and list conversion via Spark that should help. 不完全确定你要做什么,但这里有一些关于字典和列表转换的方法,通过Spark应该有所帮助。 One very important thing to note is if you want to use structures like list/dict then I suggest working on a single machine (if your data set can fit into memory) rather than trying to distribute computation via Spark only to collect all the data back to a single machine to do more processing. 需要注意的一件非常重要的事情是,如果你想使用像list / dict这样的结构,那么我建议你在一台机器上工作(如果你的数据集适合内存),而不是试图通过Spark分配计算只收集所有数据到一台机器做更多的处理。 There are some nice single machine Python graph packages too since you are working with Spark GraphFrames. 由于您正在使用Spark GraphFrames,因此还有一些不错的单机Python图形包。 Hope this helps. 希望这可以帮助。

# load your sample data set
data = [(0, 154618822656),\
        (1, 154618822656),\
        (2, 154618822656),\
        (3, 154618822656),\
        (4, 420906795008),\
        (5, 420906795008),]

df = spark.createDataFrame(data, ("id", "comp"))

df.show()

+---+------------+
| id|        comp|
+---+------------+
|  0|154618822656|
|  1|154618822656|
|  2|154618822656|
|  3|154618822656|
|  4|420906795008|
|  5|420906795008|
+---+------------+

# get desired format like {"154618822656" : {0, 1}, "420906795008": {2, 3, 4, 5}} from your post
from pyspark.sql.functions import collect_list

df.groupBy("comp").agg(collect_list("id").alias("id")).show()
+------------+------------+
|        comp|          id|
+------------+------------+
|154618822656|[0, 1, 2, 3]|
|420906795008|      [4, 5]|
+------------+------------+

# you can convert col to a list ***collect() is not recommended for larger datasets***
l = [i for i in df.select("id").rdd.flatMap(lambda x: x).collect()]

print(type(l))
print(l)
<class 'list'>
[0, 1, 2, 3, 4, 5]

# write to json so you can get a dictionary format like you were mentioning
df.groupBy("comp").agg(collect_list("id").alias("id")).write.json("data.json")

! cat data.json/*.json
{"comp":154618822656,"id":[0,1,2,3]}
{"comp":420906795008,"id":[4,5]}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 无论如何,是否要将数据从Python pandas数据框中的单个列复制到字符串或列表中以进行进一步处理? - Is there anyway to copy the data from a single column in a Python pandas dataframe into a string or list for further processing? pyspark从pyspark sql数据帧创建字典数据 - pyspark create dictionary data from pyspark sql dataframe PySpark:将字典数据附加到PySpark DataFrame - PySpark: Attach dictionary data to PySpark DataFrame PySpark - 从字典中创建一个 Dataframe,其中包含每个键的值列表 - PySpark - Create a Dataframe from a dictionary with list of values for each key 将标准 python 键值字典列表转换为 pyspark 数据框 - Convert a standard python key value dictionary list to pyspark data frame PySpark RDD到带有元组和字典列表的数据框 - PySpark RDD to dataframe with list of tuple and dictionary 如何将Pyspark数据框转换为Python字典 - How to convert Pyspark dataframe to Python Dictionary 将字典另存为 pyspark Dataframe 并加载它 - Python,Databricks - Save dictionary as a pyspark Dataframe and load it - Python, Databricks 如何进一步将 json 文件中的嵌套字典解析为 Python 中的 dataframe - How to further parse a nested dictionary in a json file to a dataframe in Python dataframe 来自列表中字典的字典 - dataframe from dictionary of dictionary in list
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM