简体   繁体   English

redshift 胶水作业 bigint 问题

[英]redshift glue job bigint issue

I have a redshi9ft database.我有一个 redshi9ft 数据库。 in the database i have created a table and in the table i have a bigint column.在数据库中我创建了一个表,在表中我有一个 bigint 列。 i created a glue job to insert data in to redshift.我创建了一个粘合作业来将数据插入到 redshift 中。 but problem is with bigint field.但问题在于 bigint 字段。 it is not inserting.它没有插入。 seems some issue with bigint. bigint 似乎有些问题。 job code is below.工作代码如下。 I am using python 3 and spark 2.2,我正在使用 python 3 和 spark 2.2,

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

args = getResolvedOptions(sys.argv, ['TempDir','JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
 spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "test", table_name = 
"tbl_test", transformation_ctx = "datasource0")

applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("testdata", "string", 
 "testdata", "string"), ("selling", "bigint", "selling", "bigint")], transformation_ctx = "applymapping1")

resolvechoice2 = ResolveChoice.apply(frame = applymapping1, choice = "make_cols", 
 transformation_ctx = "resolvechoice2")

dropnullfields3 = DropNullFields.apply(frame = resolvechoice2, transformation_ctx = 
"dropnullfields3")

 datasink4 = glueContext.write_dynamic_frame.from_jdbc_conf(frame = dropnullfields3, 
 catalog_connection = "redshift_con", connection_options = {"dbtable": "tbl_test", 
 "database": "test"}, redshift_tmp_dir = args["TempDir"], transformation_ctx = "datasink4")
 job.commit()

Try casting the types to "long" in your ApplyMapping call.尝试在 ApplyMapping 调用中将类型转换为“long”。 If your glue job is not failing on the write to Redshift sometimes a new column will be created with the same name and the redshift datatype.如果您的粘合作业在写入 Redshift 时没有失败,有时会创建一个具有相同名称和 redshift 数据类型的新列。 In this case, selling_long在这种情况下, selling_long

The mappings for Spark to Redshift can be found in the jdbc driver here .可以在此处的 jdbc 驱动程序中找到 Spark 到 Redshift 的映射。

|  Spark Type   |        JDBC Type         |
|---------------|--------------------------|
| IntegerType   | INTEGER                  |
| LongType      | BIGINT                   |
| DoubleType    | DOUBLE PRECISION         |
| FloatType     | REAL                     |
| ShortType     | INTEGER                  |
| ByteType      | SMALLINT                 |
| BooleanType   | BOOLEAN                  |
| StringType    | [VARCHAR|TEXT]           |
| TimestampType | TIMESTAMP                |
| DateType      | DATE                     |
| DecimalType   | DECIMAL(precision,scale) |

Try using the Mapping: ("selling", "int", "selling", "long")尝试使用映射:(“销售”,“整数”,“销售”,“长”)

If this doesn't work, you should post what the "tbl_test" definition in the Glue Catalog looks like.如果这不起作用,您应该发布 Glue Catalog 中“tbl_test”定义的样子。 The first type in your ApplyMapping should match the type listed in the Catalog's table definition. ApplyMapping 中的第一个类型应该与 Catalog 的表定义中列出的类型相匹配。

I had a similar issue, it turned out the type on the glue table created by the Glue Crawler in the console was 'int', not 'long', so the ApplyMapping needed to be ("fieldName", "int", "fieldName", "long") in the Glue Job for the Redshift Type 'bigint'.我遇到了类似的问题,结果是控制台中的 Glue Crawler 创建的胶水表上的类型是“int”,而不是“long”,因此 ApplyMapping 需要是 ("fieldName", "int", "fieldName" ", "long") 在 Redshift 类型 'bigint' 的胶水作业中。

Interestingly, it allowed me to keep the value in the Glue DynamicFrame and even print it to the logs immediately before writing when I had the ApplyMapping as ("field", "long", "field", "long"), but would not write the data to Redshift.有趣的是,当我将 ApplyMapping 设为 ("field", "long", "field", "long") 时,它允许我将值保留在 Glue DynamicFrame 中,甚至在写入之前立即将其打印到日志中,但不会将数据写入 Redshift。

Hope this helps!希望这可以帮助!

We had the same issue, importing values with int and bigint only resulted in delivering one of both datatypes.我们遇到了同样的问题,使用 int 和 bigint 导入值只会导致提供两种数据类型之一。
We came around it with this solution:我们用这个解决方案解决了这个问题:

1) Make sure that your source table in glue crawler has "bigint" as datatype 1)确保胶水爬虫中的源表的数据类型为“bigint”
2) Make sure that this line of code is in your Glue job: 2) 确保这行代码在您的 Glue 作业中:

applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("testdata", "string", "testdata", "string"), ("selling", "long", "selling", "long")], transformation_ctx = "applymapping1")

3) After step 2 and all the stuff until dropnullfields3 (and this was our final solution) you have to cast to long again! 3) 在第 2 步和所有内容之后,直到 dropnullfields3(这是我们的最终解决方案),您必须再次转换为 long with the following codeline:使用以下代码行:

castedFrame = dropnullfields3.resolveChoice(specs = [('selling','cast:long')])

4) Now you can simply use this DF for your final loading line: 4) 现在您可以简单地将此 DF 用于您的最终装载线:

datasink4 = glueContext.write_dynamic_frame.from_jdbc_conf(frame = castedFrame, catalog_connection = "redshift_con", connection_options = {"dbtable": "tbl_test", "database": "test"}, redshift_tmp_dir = args["TempDir"], transformation_ctx = "datasink4") job.commit()

Hope that helps!希望有帮助!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM