简体   繁体   English

单个管道中的多个 Spark DataFrame 突变

[英]Multiple Spark DataFrame mutations in a single pipe

Consider a Spark DataFrame df with the following schema:考虑具有以下架构的 Spark DataFrame df

root 
|-- date: timestamp (nullable = true) 
|-- customerID: string (nullable = true) 
|-- orderID: string (nullable = true) 
|-- productID: string (nullable = true)

One column should be cast to a different type, other columns should just have their white-space trimmed.一列应该转换为不同的类型,其他列应该只修剪空白。

df.select(
  $"date",
  df("customerID").cast(IntegerType),
  $"orderID",
  $"productId")
  .withColumn("orderID", trim(col("orderID")))
  .withColumn("productID", trim(col("productID")))

The operations seem to require different syntax;这些操作似乎需要不同的语法; casting is done via select , while trim is done via withColumn .铸造是通过select完成的,而trim是通过withColumn完成的。 I'm used to R and dplyr where all the above would be handled in a single mutate function, so mixing select and withColumn feels a bit cumbersome.我习惯了Rdplyr ,其中所有上述内容都将在单个mutate函数中处理,因此混合使用selectwithColumn感觉有点麻烦。

Is there a cleaner way to do this in a single pipe?有没有更干净的方法可以在单个管道中做到这一点?

df.select(
  $"date",
  $"customerID".cast(IntegerType),
  trim($"orderID").as("orderID"),
  trim($"productID").as("productID"))

You can use either one.您可以使用任何一种。 The difference is that withColumn will add (or replace if the same name is used) a new column to the dataframe while select will only keep the columns you specified.不同之处在于withColumn将向数据withColumn添加(或在使用相同名称时替换)一个新列,而select将仅保留您指定的列。 Depending on the situation, choose one to use.根据情况选择一种使用。

The cast can be done using withColumn as follows:cast是可以做到用withColumn如下:

df.withColumn("customerID", $"customerID".cast(IntegerType))
  .withColumn("orderID", trim($"orderID"))
  .withColumn("productID", trim($"productID"))

Note that you do not need to use withColumn on the date column above.请注意,您不需要在上面的date列上使用withColumn


The trim functions can be done in a select as follows, here the column names are kept the same: trim功能可以在select完成,如下所示,这里的列名保持不变:

df.select(
  $"date",
  $"customerID".cast(IntegerType),
  trim($"orderID").as("orderID"),
  trim($"productId").as("productId"))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM