简体   繁体   English

有没有一种方法可以从Scala中的数据框的现有列创建多个列?

[英]Is there a way to create multiple columns from existing columns of a dataframe in Scala?

I am trying to ingest an RDBMS table into Hive. 我正在尝试将RDBMS表导入Hive。 I have obtained the dataframe in the following way: 我已经通过以下方式获得了数据框:

val yearDF = spark.read.format("jdbc").option("url", connectionUrl)
                                                   .option("dbtable", "(select * from schema.tablename where source_system_name='DB2' and period_year='2017') as year2017")
                                                   .option("user", devUserName)
                                                   .option("password", devPassword)
                                                   .option("numPartitions",15)
                                                   .load()

These are the columns of the dataframe: 这些是数据框的列:

geography:string|
project:string|
reference_code:string
product_line:string
book_type:string
cc_region:string
cc_channel:string
cc_function:string
pl_market:string
ptd_balance:double
qtd_balance:double
ytd_balance:double
xx_last_update_tms:timestamp
xx_last_update_log_id:int
xx_data_hash_code:string
xx_data_hash_id:bigint

The columns ptd_balance, qtd_balance, ytd_balance are double datatypes which are precision columns. ptd_balance, qtd_balance, ytd_balance是double数据类型,它们是precision列。 Our project wants to convert their datatype from Double to String by creating new columns: ptd_balance_text, qtd_balance_text, ytd_balance_text with same data inorder to avoid any data truncation. 我们的项目希望通过创建新列将其数据类型从Double转换为String: ptd_balance_text, qtd_balance_text, ytd_balance_text具有相同的数据,以避免任何数据截断。

withColumn will create a new column in the dataframe. withColumn将在数据withColumn创建一个新列。 withColumnRenamed will rename the existing column. withColumnRenamed将重命名现有的列。

The dataframe has nearly 10 million records. 该数据框有近1000万条记录。 Is there an effective way to create multiple new columns with same data and different type from the existing columns in a dataframe ? 有没有一种有效的方法来创建多个新列,这些新列具有与数据框中的现有列相同的数据和不同的类型?

You can do this creating query from all columns like below 您可以从下面的所有columns创建query

import org.apache.spark.sql.types.StringType

//Input: 

scala> df.show
+----+-----+--------+--------+
|  id| name|  salary|   bonus|
+----+-----+--------+--------+
|1001|Alice| 8000.25|1233.385|
|1002|  Bob|7526.365| 1856.69|
+----+-----+--------+--------+


scala> df.printSchema
root
 |-- id: integer (nullable = false)
 |-- name: string (nullable = true)
 |-- salary: double (nullable = false)
 |-- bonus: double (nullable = false)

//solution approach:
val query=df.columns.toList.map(cl=>if(cl=="salary" || cl=="bonus") col(cl).cast(StringType).as(cl+"_text") else col(cl))

//Output: 

scala> df.select(query:_*).printSchema
root
 |-- id: integer (nullable = false)
 |-- name: string (nullable = true)
 |-- salary_text: string (nullable = false)
 |-- bonus_text: string (nullable = false)


scala> df.select(query:_*).show
+----+-----+-----------+----------+
|  id| name|salary_text|bonus_text|
+----+-----+-----------+----------+
|1001|Alice|    8000.25|  1233.385|
|1002|  Bob|   7526.365|   1856.69|
+----+-----+-----------+----------+

If i was in your shoes, i would make changes in the extraction query or ask BI team to put some effort :P for adding and casting the fields on the fly while extracting, but any how what you are asking is possible. 如果我不满意,我会在提取查询中进行更改,或要求BI团队付出一些努力:P在提取时动态添加和转换字段,但是您提出的任何方式都是可能的。

You can add the columns from the existing columns as below. 您可以从现有列中添加列,如下所示。 Check the addColsTosampleDF dataframe . 检查addColsTosampleDF dataframe addColsTosampleDF I hope the comments below will be enough to understand, if you have any questions feel free to add in the comments and i will edit my answer. 我希望下面的评论足以理解,如果您有任何问题可以随时添加在评论中,我将编辑我的答案。

scala> import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions._

scala> import org.apache.spark.sql.{DataFrame, Row, SparkSession}
import org.apache.spark.sql.{DataFrame, Row, SparkSession}

scala> val ss = SparkSession.builder().appName("TEST").getOrCreate()
18/08/07 15:51:42 WARN SparkSession$Builder: Using an existing SparkSession; some configuration may not take effect.
ss: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@6de4071b

//Sample dataframe with int, double and string fields
scala> val sampleDf = Seq((100, 1.0, "row1"),(1,10.12,"col_float")).toDF("col1", "col2", "col3")
sampleDf: org.apache.spark.sql.DataFrame = [col1: int, col2: double ... 1 more field]

scala> sampleDf.printSchema
root
 |-- col1: integer (nullable = false)
 |-- col2: double (nullable = false)
 |-- col3: string (nullable = true)

//Adding columns col1_string from col1 and col2_doubletostring from col2 with casting and alias
scala> val addColsTosampleDF = sampleDf.
select(sampleDf.col("col1"),
sampleDf.col("col2"),
sampleDf.col("col3"),
sampleDf.col("col1").cast("string").alias("col1_string"),
sampleDf.col("col2").cast("string").alias("col2_doubletostring"))
addColsTosampleDF: org.apache.spark.sql.DataFrame = [col1: int, col2: double ... 3 more fields]

//Schema with added columns
scala> addColsTosampleDF.printSchema
root
 |-- col1: integer (nullable = false)
 |-- col2: double (nullable = false)
 |-- col3: string (nullable = true)
 |-- col1_string: string (nullable = false)
 |-- col2_doubletostring: string (nullable = false)

 scala> addColsTosampleDF.show()
+----+-----+---------+-----------+-------------------+
|col1| col2|     col3|col1_string|col2_doubletostring|
+----+-----+---------+-----------+-------------------+
| 100|  1.0|     row1|        100|                1.0|
|   1|10.12|col_float|          1|              10.12|
+----+-----+---------+-----------+-------------------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Spark / scala-我们可以从数据框中的现有列值创建新列吗 - Spark/scala - can we create new columns from an existing column value in a dataframe 将多列添加到数据框 scala - Add Multiple columns to dataframe scala Spark scala - 创建带有列的空数据框并从列表中键入 - Spark scala - Create empty dataframe with columns and type from List 从选定的 dataframe 列中创建 SOAP XML REQUEST - Create SOAP XML REQUEST from selected dataframe columns in Scala 在 Scala 中使用另一个没有数组列的 DataFrame 的数组类型列创建 Spark DataFrame 的有效方法是什么? - What is the efficient way to create Spark DataFrame in Scala with array type columns from another DataFrame that does not have an array column? Spark Scala Dataframe如何使用两个或多个现有列创建新列 - Spark Scala Dataframe How to create new column with two or more existing columns 将网站 url 拆分为多列到 scala dataframe - Split website url into multiple columns into scala dataframe 在 Scala 中将 Dataframe 的多列写入 Kafka - Writing multiple columns of a Dataframe to Kafka in Scala 通过遍历 Scala 列名列表中的列,从 Spark 数据框中删除多个列 - Dropping multiple columns from Spark dataframe by Iterating through the columns from a Scala List of Column names Spark scala 从数组列创建多个列 - Spark scala create multiple columns from array column
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM