PySpark Dataframe 的最佳实践 - 删除多列？

Question

Let's say one wants to drop a column from a dataframe.假设有人想从 dataframe 中删除一列。 Can that be done without creating a new dataframe?可以在不创建新的 dataframe 的情况下完成吗？

df = df.drop("tags_s")

It seems like creating a new dataframe is safer, more correct, is that correct?似乎创建一个新的 dataframe 更安全，更正确，对吗？ What problems might one run into by re-using a dataframe as above?如上所述重新使用 dataframe 可能会遇到什么问题？

If re-using a dataframe is a bad practice, let's say one wants to drop several columns that match a pattern:如果重复使用 dataframe 是一种不好的做法，假设有人想要删除与模式匹配的几列：

for col in df.columns:
  if col.startswith("aux_"):
    df = df.drop(col)

Creating a new dataframe each time seems impractical in this case.在这种情况下，每次都创建一个新的 dataframe 似乎不切实际。 What is the best practice?最佳做法是什么？

Answer 1

If you are going to drop multiple columns, I'd say that the first step is identify the columns, save it in a list and then do a single .drop , something like:如果要删除多个列，我想说第一步是识别列，将其保存在列表中，然后执行单个.drop ，例如：

your_column_list = [col for col in df.columns if col.startswith("aux") ]
df.drop(*your_column_list)

Answer 2

This is according to my understanding with spark dataframe, you don't have to worry about it returning new dataframe each time, what you're doing there is just a transformation on the dataframe.这是根据我对火花 dataframe 的理解，你不必担心它每次都会返回新的 dataframe，你在做什么只是对 Z6A8064B5DF47945555005Z53C4C7 进行改造You can perform many transformations to the dataframe, it's only when you perform any action against the dataframe that's when it's produce a new dataframe.您可以对 dataframe 执行许多转换，只有当您对 dataframe 执行任何操作时，它才会生成新的 Z6A8064B5DF479455500553C47C5507。

Here is more information on transformations vs actions, https://spark.apache.org/docs/latest/rdd-programming-guide.html#basics以下是有关转换与操作的更多信息， https://spark.apache.org/docs/latest/rdd-programming-guide.html#basics

PySpark Dataframe 的最佳实践 - 删除多列？

问题描述

2 个解决方案

解决方案1
1 2019-11-21 16:45:07

解决方案2
0 2019-11-21 16:53:38

PySpark Dataframe 的最佳实践 - 删除多列？

问题描述

2 个解决方案

解决方案1 1 2019-11-21 16:45:07

解决方案2 0 2019-11-21 16:53:38

解决方案1
1 2019-11-21 16:45:07

解决方案2
0 2019-11-21 16:53:38