简体   繁体   English

更改 AWS Glue Pyspark 中的分隔符

[英]Change the delimiter in AWS Glue Pyspark

abv_data = glueContext.create_dynamic_frame_from_options("s3", \
                   {'paths': ["s3://{}/{}".format(bucket, prefix)], \
                   "recurse":True, 'groupFiles': 'inPartition'},"csv",{'withHeader':True},separator='\t')
        
        abv_df_1 = abv_data.toDF()
        abv_df_2 = abv_df_1.withColumn("save_date", lit(datetime.utcnow().strftime("%Y-%m-%d %H:%M:%S")))
        conparms_r = glueContext.extract_jdbc_conf("reporting", catalog_id = None)

        abv_df_2.write\
          .format("com.databricks.spark.redshift")\
          .option("url", "jdbc:redshift://rs_cluster:8192/rptg")\
          .option("dbtable", redshift_schema_table_output)\
          .option("user", conparms_r['user'])\
          .option("password", conparms_r['password'])\
          .option("aws_iam_role", "arn:aws:iam::123456789:role/redshift_admin_role")\
          .option("tempdir", args["TempDir"])\
          .option("extracopyoptions","DELIMITER '\t' IGNOREHEADER 1 DATEFORMAT AS 'YYYY-MM-DD'")\
          .mode("append")\
          .save()

The csv has a tab delimiter on read, but when I add the column to the dataframe is uses a comma delimiter and is causing the Redshift load to fail. csv 在读取时有一个制表符分隔符,但是当我将列添加到数据帧时,它使用逗号分隔符并导致 Redshift 加载失败。

Is there a way to add the column with a tab delimiter OR change the delimiter on the entire data frame?有没有办法添加带有制表符分隔符的列或更改整个数据框上的分隔符?

This isn't necessarily the way to do this, but here is what I ended up doing:这不一定是这样做的方法,但这是我最终做的:

bring the csv in with a ',' separator.用','分隔符将csv带入。

glueContext.create_dynamic_frame_from_options("s3", \
                   {'paths': ["s3://{}/{}".format(bucket, prefix)], \
                   "recurse":True, 'groupFiles': 'inPartition'},"csv",{'withHeader':True}, separator = ',')

Then split the first column on tab and then add all the splits to their own column and add the extra column at the same time.然后拆分选项卡上的第一列,然后将所有拆分添加到自己的列中,并同时添加额外的列。

Drop the first column because it is still the combined column.删除第一列,因为它仍然是组合列。

This gives you a comma seperated df to load.这为您提供了一个逗号分隔的 df 来加载。

Use spark.read.option("delimiter", "\\t").csv(file) or sep instead of delimiter.使用spark.read.option("delimiter", "\\t").csv(file)或 sep 而不是分隔符。

For, special character, use double \\: spark.read.option("delimiter", "\\\\t").csv(file)对于特殊字符,请使用双\\: spark.read.option("delimiter", "\\\\t").csv(file)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM