简体   繁体   中英

Export data from Pyspark Dataframe into Dictionary or List for further processing Python

I am trying to retrieve values from a Pyspark Dataframe after Pyspark does work to find Connected Components, but I don't understand how to extract that data as you would, say, from a list.

Below is a simplified version of the table created from the large dataset I'm working from. Essentially, the following table is created by using connectivity data on vertices and edges of graphs. If a component number is the same, it means that the the nodes (ids) lie in the same graph structure.


    +---+------------+
    | id|   component|
    +---+------------+
    |  0|154618822656|
    |  1|154618822656|
    |  2|154618822656|
    |  3|154618822656|
    |  4|420906795008|
    |  5|420906795008|
    +---+------------+

I've tried a lot of things to extract the data into forms I am most used to like lists and dictionaries. When I try various methods in the docs, I get outputs like:

[Row(id='0', component=154618822656), Row(id='1', component=154618822656)]

which I'm not sure how to work. I've also seen an asDict() method in Pyspark but I cannot get it to work on even a simple table.

This is an example function that takes the graphframe, finds the connected components and creates a table. All is well until I want to put the data in another structure:

def get_connected_components(graphframe):
    connected_table = g.connectedComponents()
    connected_table.collect()
    conn = connected_table.rdd.take(2)
    print(conn)

I'd ultimately like to have something like this:

{"154618822656" : {0, 1}, "420906795008": {2, 3, 4, 5}}

which I would turn into a further output like:

0 1
2 3 4 5

This may be the wrong route on how to operate these tables but I'm brand new to Pyspark and surprised at how tricky this is even with all the searching. Thank you in advance.

Not entirely sure what you are trying to do, but here are some methods on dictionary and list conversion via Spark that should help. One very important thing to note is if you want to use structures like list/dict then I suggest working on a single machine (if your data set can fit into memory) rather than trying to distribute computation via Spark only to collect all the data back to a single machine to do more processing. There are some nice single machine Python graph packages too since you are working with Spark GraphFrames. Hope this helps.

# load your sample data set
data = [(0, 154618822656),\
        (1, 154618822656),\
        (2, 154618822656),\
        (3, 154618822656),\
        (4, 420906795008),\
        (5, 420906795008),]

df = spark.createDataFrame(data, ("id", "comp"))

df.show()

+---+------------+
| id|        comp|
+---+------------+
|  0|154618822656|
|  1|154618822656|
|  2|154618822656|
|  3|154618822656|
|  4|420906795008|
|  5|420906795008|
+---+------------+

# get desired format like {"154618822656" : {0, 1}, "420906795008": {2, 3, 4, 5}} from your post
from pyspark.sql.functions import collect_list

df.groupBy("comp").agg(collect_list("id").alias("id")).show()
+------------+------------+
|        comp|          id|
+------------+------------+
|154618822656|[0, 1, 2, 3]|
|420906795008|      [4, 5]|
+------------+------------+

# you can convert col to a list ***collect() is not recommended for larger datasets***
l = [i for i in df.select("id").rdd.flatMap(lambda x: x).collect()]

print(type(l))
print(l)
<class 'list'>
[0, 1, 2, 3, 4, 5]

# write to json so you can get a dictionary format like you were mentioning
df.groupBy("comp").agg(collect_list("id").alias("id")).write.json("data.json")

! cat data.json/*.json
{"comp":154618822656,"id":[0,1,2,3]}
{"comp":420906795008,"id":[4,5]}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM