简体   繁体   English

PySpark - 按列的值拆分/过滤DataFrame

[英]PySpark - Split/Filter DataFrame by column's values

I have a DataFrame similar to this example: 我有一个类似于这个例子的DataFrame:

Timestamp | Word | Count

30/12/2015 | example_1 | 3

29/12/2015 | example_2 | 1

28/12/2015 | example_2 | 9

27/12/2015 | example_3 | 7

... | ... | ...

and i want to split this data frame by 'word' column's values to obtain a "list" of DataFrame (to plot some figures in a next step). 我想通过'word'列的值拆分这个数据框,以获得DataFrame的“列表”(在下一步中绘制一些数字)。 For example: 例如:

DF1 DF1

Timestamp | Word | Count

30/12/2015 | example_1 | 3

DF2 DF2

Timestamp | Word | Count

29/12/2015 | example_2 | 1

28/12/2015 | example_2 | 9

DF3 DF3

Timestamp | Word | Count

27/12/2015 | example_3 | 7

Is there a way to do this with PySpark (1.6)? 有没有办法用PySpark(1.6)做到这一点?

It won't be efficient but you can map with filter over the list of unique values: 它不会有效,但您可以使用筛选器映射唯一值列表:

words = df.select("Word").distinct().flatMap(lambda x: x).collect()
dfs = [df.where(df["Word"] == word) for word in words]

Post Spark 2.0 发布Spark 2.0

words = df.select("Word").distinct().rdd.flatMap(lambda x: x).collect()

In addition to what zero323 said, I would might add 除了零323所说的,我还想补充一下

word.persist()

before the creation of the dfs, so the "words" dataframe won't need to be transformed each time when you will have an action on each of your "dfs" 在创建dfs之前,每次当你对每个“dfs”执行操作时,都不需要转换“words”数据帧

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM