I have a DataFrame similar to this example:
Timestamp | Word | Count
30/12/2015 | example_1 | 3
29/12/2015 | example_2 | 1
28/12/2015 | example_2 | 9
27/12/2015 | example_3 | 7
... | ... | ...
and i want to split this data frame by 'word' column's values to obtain a "list" of DataFrame (to plot some figures in a next step). For example:
DF1
Timestamp | Word | Count
30/12/2015 | example_1 | 3
DF2
Timestamp | Word | Count
29/12/2015 | example_2 | 1
28/12/2015 | example_2 | 9
DF3
Timestamp | Word | Count
27/12/2015 | example_3 | 7
Is there a way to do this with PySpark (1.6)?
It won't be efficient but you can map with filter over the list of unique values:
words = df.select("Word").distinct().flatMap(lambda x: x).collect()
dfs = [df.where(df["Word"] == word) for word in words]
Post Spark 2.0
words = df.select("Word").distinct().rdd.flatMap(lambda x: x).collect()
In addition to what zero323 said, I would might add
word.persist()
before the creation of the dfs, so the "words" dataframe won't need to be transformed each time when you will have an action on each of your "dfs"
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.