簡體   English   中英

如何使用pyspark從CSV格式的Spark中設置正確的數據類型

[英]How to set the right Data Type in parquet with Spark from a CSV with pyspark

我有一個csv文件,看起來像:

39813458,13451345,14513,SomeText,344564,Some other text,328984,"[{""field_int_one"":""16784832510"",""second_int_field"":""84017"",""third_int_field"":""245"",""some_timestamp_one"":""2018-04-17T23:54:34.000Z"",""some_other_timestamp"":""2018-03-03T15:34:04.000Z"",""one_more_int_field"":0,},{""field_int_one"":""18447548326"",""second_int_field"":""04965"",""third_int_field"":""679"",""some_timestamp_one"":""2018-02-06T03:39:12.000Z"",""some_other_timestamp"":""2018-03-01T09:19:12.000Z"",""one_more_int_field"":0}]"

我將其轉換為鑲木地板

from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark.sql.session import SparkSession
sc = SparkContext('local')
spark = SparkSession(sc)

if __name__ == "__main__":
    sqlContext = SQLContext(sc)

    schema = StructType([
              StructField("first_int", IntegerType(), True),
              StructField("second_int", IntegerType(), True),
              StructField("third_int", IntegerType(), True),
              StructField("first_string_field", StringType(), True),
              StructField("fourth_int", IntegerType(), True),
              StructField("second_string_field", StringType(), True),
              StructField("last_int_field", StringType(), True),
              StructField("json_field", StringType(), True)])

    rdd = spark.read.schema(schema).csv("source_file.csv")
    rdd.write.parquet('parquet_output')

它可以工作並進行轉換,但是如果在查詢后執行.printSchema ,則顯然會將其定義打印為String。 我如何正確地將最后一個字段聲明為Json?

我認為嵌套的ArrayType將適用於這種類型的架構

schema = StructType([
          StructField("first_int", IntegerType(), True),
          StructField("second_int", IntegerType(), True),
          StructField("third_int", IntegerType(), True),
          StructField("first_string_field", StringType(), True),
          StructField("fourth_int", IntegerType(), True),
          StructField("second_string_field", StringType(), True),
          StructField("last_int_field", StringType(), True),
          StructField("json_field", ArrayType(
                StructType() \
                   .add("field_int_one", IntegerType()) \
                   .add("field_string_one", StringType()) \
                   .addMoreFieldsHere), 
          True)])

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM