简体   繁体   English

如何更改现有 dataframe 的模式

[英]How to change the schema of existing dataframe

Problem statement: I have a csv file with around 100+ fields.I need to perform transformation on these fields and generate new 80+ fields and write only these new fields into s3 in parquet format.问题陈述:我有一个 csv 文件,其中包含大约 100 多个字段。我需要对这些字段执行转换并生成新的 80 多个字段,并仅将这些新字段以 parquet 格式写入 s3。

The parquet predefined schema = 80+ newly populated fields + some non populated fields. parquet 预定义架构 = 80+ 新填充的字段 + 一些未填充的字段。

Is there any way to pass this predefined parquet schema while writing data to s3 so that these extra fields also populated with null data.有什么方法可以在将数据写入 s3 时传递此预定义的镶木地板架构,以便这些额外字段也填充 null 数据。

select will not be useful to select only 80+ fields as predefined schema might have around 120 predefined fields. select 对 select 只有 80 多个字段没有用,因为预定义架构可能有大约 120 个预定义字段。

Below is sample data and transformation requirementCSV data下面是示例数据和转换要求CSV数据

aid, productId, ts, orderId

1000,100,1674128580179,edf9929a-f253-487

1001,100,1674128580179,cc41a026-63df-410

1002,100,1674128580179,9732755b-1207-471

1003,100,1674128580179,51125ddd-4129-48a

1001,200,1674128580179,f4917676-b08d-41e

1004,200,1674128580179,dc80559d-16e6-4fa

1005,200,1674128580179,c9b743eb-457b-455

1006,100,1674128580179,e8611141-3e0e-4d5
1002,200,1674128580179,30be34c7-394c-43a

Parquet schema镶木地板架构

def getPartitionFieldsSchema() = {
  List(
    Map("name" -> "company", "type" -> "long",
      "nullable" -> true, "metadata" -> Map()),
    Map("name" -> "epoch_day", "type" -> "long",
      "nullable" -> true, "metadata" -> Map()),
    Map("name" -> "account", "type" -> "string",
      "nullable" -> true, "metadata" -> Map()),
  )
}

val schemaMap = Map("type" -> "struct",
  "fields" -> getPartitionFieldsSchema)

simple example简单的例子

val dataDf = spark
  .read
  .format("csv")
  .option("header", "true")
  .option("inferSchema", "true")
  .load("./scripts/input.csv")


dataDf
  .withColumn("company",lit(col("aid")/100))
  .withColumn("epoch_day",lit(col("ts")/86400))
  .write   // how to write only company, epoch_day, account ?
  .mode("append")
  .csv("/tmp/data2")

Output should have below columns: company, epoch_day, account Output 应该有以下几列:company, epoch_day, account

This is how I understand your problem: you wanna read some csv and transform them to parquet in s3.这就是我对您的问题的理解:您想阅读一些 csv 并将它们转换为 s3 中的镶木地板。 during the transformation, you need to create 3 new cols based on existing cols in csv files.在转换过程中,您需要在 csv 文件中的现有列的基础上创建 3 个新列。 but since only 2 out of 3 new cols are calculated, the output are only showing two new cols but not 3.但是由于只计算了 3 个新列中的 2 个,因此 output 只显示了两个新列,而不是 3 个。

In such case, you can create a external table in redshift, and you specify all the cols.在这种情况下,您可以在 redshift 中创建一个外部表,并指定所有列。 As a result even some colums are not fed, there would be null in your external tables.结果,即使有些列没有被馈送,您的外部表中也会有 null。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM