简体   繁体   English

如何使用 pyspark 在 Hadoop 中读取 parquet 文件、更改数据类型并写入另一个 Parquet 文件

[英]How to Read a parquet file , change datatype and write to another Parquet file in Hadoop using pyspark

My source parquet file has everything as string.我的源镶木地板文件将所有内容都作为字符串。 My destination parquet file needs to convert this to different datatype like int, string, date etc. How do I do this?我的目标 parquet 文件需要将其转换为不同的数据类型,如 int、string、date 等。我该怎么做?

You should read the file and then typecast all the columns as required and save them您应该阅读该文件,然后根据需要对所有列进行类型转换并保存它们

from pyspark.sql.functions import *
df = spark.read.parquet('/path/to/file')
df = df.select(col('col1').cast('int'), col('col2').cast('string'))
df.write.parquet('/target/path')

you may wanted to apply userdefined schema to speedup data loading.您可能希望应用用户定义的模式来加速数据加载。 There are 2 ways to apply that-有两种应用方法-

using the input DDL-formatted string使用输入的 DDL 格式的字符串

spark.read.schema("a INT, b STRING, c DOUBLE").parquet("test.parquet")

Use StructType schema使用 StructType 架构

customSchema = StructType([
        StructField("a", IntegerType(), True),
        StructField("b", StringType(), True),
        StructField("c", DoubleType(), True)])
spark.read.schema(customSchema).parquet("test.parquet")

Data File: |数据文件:| data_extract_id| data_extract_id| Alien_Dollardiff| Alien_Dollardiff| Alien_Dollar Alien_Dollar

|ab1def1gh-123-ea0| |ab1def1gh-123-ea0| 0| 0| 0 0

Script:脚本:

def createPrqtFParqt (datPath, parquetPath, inpustJsonSchema, outputdfSchema):
  print("## Parsing " + datPath)
  df = ssc.read.schema(outputdfSchema).parquet(datPath)
  print("## Writing " + parquetPath)
  df.write.mode("overwrite").parquet(parquetPath)

Output: An error occured while calling Parquet. Output:调用 Parquet 时出错。 Column: Alien_Dollardiff|专栏:Alien_Dollardiff| Expected double Found BINARY.预期为双 找到 BINARY。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM