简体   繁体   English

PySpark Dataframe 的最佳实践 - 删除多列?

[英]Best practice for PySpark Dataframe - to drop multiple columns?

Let's say one wants to drop a column from a dataframe.假设有人想从 dataframe 中删除一列。 Can that be done without creating a new dataframe?可以在不创建新的 dataframe 的情况下完成吗?

df = df.drop("tags_s")

It seems like creating a new dataframe is safer, more correct, is that correct?似乎创建一个新的 dataframe 更安全,更正确,对吗? What problems might one run into by re-using a dataframe as above?如上所述重新使用 dataframe 可能会遇到什么问题?

If re-using a dataframe is a bad practice, let's say one wants to drop several columns that match a pattern:如果重复使用 dataframe 是一种不好的做法,假设有人想要删除与模式匹配的几列:

for col in df.columns:
  if col.startswith("aux_"):
    df = df.drop(col)

Creating a new dataframe each time seems impractical in this case.在这种情况下,每次都创建一个新的 dataframe 似乎不切实际。 What is the best practice?最佳做法是什么?

If you are going to drop multiple columns, I'd say that the first step is identify the columns, save it in a list and then do a single .drop , something like:如果要删除多个列,我想说第一步是识别列,将其保存在列表中,然后执行单个.drop ,例如:

your_column_list = [col for col in df.columns if col.startswith("aux") ]
df.drop(*your_column_list)

This is according to my understanding with spark dataframe, you don't have to worry about it returning new dataframe each time, what you're doing there is just a transformation on the dataframe.这是根据我对火花 dataframe 的理解,你不必担心它每次都会返回新的 dataframe,你在做什么只是对 Z6A8064B5DF47945555005Z53C4C7 进行改造You can perform many transformations to the dataframe, it's only when you perform any action against the dataframe that's when it's produce a new dataframe.您可以对 dataframe 执行许多转换,只有当您对 dataframe 执行任何操作时,它才会生成新的 Z6A8064B5DF479455500553C47C5507。

Here is more information on transformations vs actions, https://spark.apache.org/docs/latest/rdd-programming-guide.html#basics以下是有关转换与操作的更多信息, https://spark.apache.org/docs/latest/rdd-programming-guide.html#basics

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM