PySpark - Split/Filter DataFrame by column's values

Question

I have a DataFrame similar to this example:

Timestamp | Word | Count

30/12/2015 | example_1 | 3

29/12/2015 | example_2 | 1

28/12/2015 | example_2 | 9

27/12/2015 | example_3 | 7

... | ... | ...

and i want to split this data frame by 'word' column's values to obtain a "list" of DataFrame (to plot some figures in a next step). For example:

DF1

Timestamp | Word | Count

30/12/2015 | example_1 | 3

DF2

Timestamp | Word | Count

29/12/2015 | example_2 | 1

28/12/2015 | example_2 | 9

DF3

Timestamp | Word | Count

27/12/2015 | example_3 | 7

Is there a way to do this with PySpark (1.6)?

Answer 1

It won't be efficient but you can map with filter over the list of unique values:

words = df.select("Word").distinct().flatMap(lambda x: x).collect()
dfs = [df.where(df["Word"] == word) for word in words]

Post Spark 2.0

words = df.select("Word").distinct().rdd.flatMap(lambda x: x).collect()

Answer 2

In addition to what zero323 said, I would might add

word.persist()

before the creation of the dfs, so the "words" dataframe won't need to be transformed each time when you will have an action on each of your "dfs"

PySpark - Split/Filter DataFrame by column's values

Question

2 answers

solution1
4 ACCPTED 2016-02-04 00:23:37

solution2
1 2017-08-24 08:15:23

PySpark - Split/Filter DataFrame by column's values

Question

2 answers

solution1 4 ACCPTED 2016-02-04 00:23:37

solution2 1 2017-08-24 08:15:23

solution1
4 ACCPTED 2016-02-04 00:23:37

solution2
1 2017-08-24 08:15:23