Pyspark - 為數據框定義自定義架構

Question

我正在嘗試讀取 csv 文件，並嘗試將其存儲在數據框中，但是當我嘗試創建StringType類型的ID列時，它沒有以預期的方式發生。

table_schema = StructType([StructField('ID', StringType(), True),
                     StructField('Name', StringType(), True),
                     StructField('Tax_Percentage(%)', IntegerType(), False),
                     StructField('Effective_From', TimestampType(), False),
                     StructField('Effective_Upto', TimestampType(), True)])

# CSV options
infer_schema = "true"
first_row_is_header = "true"
delimiter = ","


df = spark.read.format(file_type) \
  .option("inferSchema", infer_schema) \
  .option("header", first_row_is_header) \
  .option("sep", delimiter) \
  .option("schema", table_schema) \
  .load(file_location)



display(df)

以下是運行上述代碼后生成的架構：

df:pyspark.sql.dataframe.DataFrame
ID:integer
Name:string
Tax_Percentage(%):integer
Effective_From:string
Effective_Upto :string

盡管提供了自定義架構，但ID被輸入為integer ，我希望它是字符串。 與Effective_From和Effective_Upto列相同。

Answer 1

它應該是

.schema(table_schema) \

代替

.option("schema", table_schema) \

此外，如果您提供架構定義，則不需要.option("inferSchema", "true") \\ :)

Pyspark - 為數據框定義自定義架構

問題描述

1 個解決方案

解決方案1
3 已采納 2019-09-12 07:48:32

Pyspark - 為數據框定義自定義架構

問題描述

1 個解決方案

解決方案1 3 已采納 2019-09-12 07:48:32

解決方案1
3 已采納 2019-09-12 07:48:32