如何使用 pyspark 在 Hadoop 中读取 parquet 文件、更改数据类型并写入另一个 Parquet 文件

Question

My source parquet file has everything as string.我的源镶木地板文件将所有内容都作为字符串。 My destination parquet file needs to convert this to different datatype like int, string, date etc. How do I do this?我的目标 parquet 文件需要将其转换为不同的数据类型，如 int、string、date 等。我该怎么做？

Answer 1

You should read the file and then typecast all the columns as required and save them您应该阅读该文件，然后根据需要对所有列进行类型转换并保存它们

from pyspark.sql.functions import *
df = spark.read.parquet('/path/to/file')
df = df.select(col('col1').cast('int'), col('col2').cast('string'))
df.write.parquet('/target/path')

Answer 2

you may wanted to apply userdefined schema to speedup data loading.您可能希望应用用户定义的模式来加速数据加载。 There are 2 ways to apply that-有两种应用方法-

using the input DDL-formatted string使用输入的 DDL 格式的字符串

spark.read.schema("a INT, b STRING, c DOUBLE").parquet("test.parquet")

Use StructType schema使用 StructType 架构

customSchema = StructType([
        StructField("a", IntegerType(), True),
        StructField("b", StringType(), True),
        StructField("c", DoubleType(), True)])
spark.read.schema(customSchema).parquet("test.parquet")

Answer 3

|ab1def1gh-123-ea0| |ab1def1gh-123-ea0| 0| 0| 0 0

Script:脚本：

def createPrqtFParqt (datPath, parquetPath, inpustJsonSchema, outputdfSchema):
  print("## Parsing " + datPath)
  df = ssc.read.schema(outputdfSchema).parquet(datPath)
  print("## Writing " + parquetPath)
  df.write.mode("overwrite").parquet(parquetPath)

Output: An error occured while calling Parquet. Output：调用 Parquet 时出错。 Column: Alien_Dollardiff|专栏：Alien_Dollardiff| Expected double Found BINARY.预期为双找到 BINARY。

如何使用 pyspark 在 Hadoop 中读取 parquet 文件、更改数据类型并写入另一个 Parquet 文件

问题描述

3 个解决方案

解决方案1
1 2020-06-30 05:08:28

解决方案2
0 已采纳 2020-06-30 09:19:57

using the input DDL-formatted string使用输入的 DDL 格式的字符串

Use StructType schema使用 StructType 架构

解决方案3
0 2020-07-01 04:01:48

如何使用 pyspark 在 Hadoop 中读取 parquet 文件、更改数据类型并写入另一个 Parquet 文件

问题描述

3 个解决方案

解决方案1 1 2020-06-30 05:08:28

解决方案2 0 已采纳 2020-06-30 09:19:57

using the input DDL-formatted string使用输入的 DDL 格式的字符串

Use StructType schema使用 StructType 架构

解决方案3 0 2020-07-01 04:01:48

解决方案1
1 2020-06-30 05:08:28

解决方案2
0 已采纳 2020-06-30 09:19:57

解决方案3
0 2020-07-01 04:01:48