繁体   English   中英

AWS GLUE 错误:调用 o75.pyWriteDynamicFrame 时出错。 无法将 STRING 转换为 IntegerType(值:BsonString{value=''})

[英]AWS GLUE ERROR : An error occurred while calling o75.pyWriteDynamicFrame. Cannot cast STRING into a IntegerType (value: BsonString{value=''})

我有一个简单的 glue pyspark 作业,它通过 glue 目录表连接到 Mongodb 源,并使用 glue 动态框架从 Mongodb collections 中提取数据并写入 json output 到 s3。 这里的 Mongo 数据库是深度嵌套的 no sql with structs 和 arrays。由于它是一个 no-sql db,源模式是不固定的。 嵌套的列可能因文档而异。 但是,作业失败并出现以下错误。

错误py4j.protocol.Py4JJavaError: An error occurred while calling o75.pyWriteDynamicFrame.: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 6, 10.3.29.22, executor 1): com.mongodb.spark.exceptions.MongoTypeConversionException: Cannot cast STRING into a IntegerType (value: BsonString{value=''})

因为,由于数据类型不匹配的原因,作业失败了,我已经尝试了所有可能的解决方案,比如使用resolveChoice() 由于错误是针对具有“int”数据类型的属性,我尝试将所有具有“int”类型的属性转换为“string”。 我还尝试了dropnullfields的代码, writing with spark dataframeapplymappingwithout using catalog table (from_options directly from mongo table)with and without repartition repartition。 所有这些尝试都在代码中进行了注释以供参考。

代码片段

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame
from awsglue.job import Job

args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

print("Started")
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "<catalog_db_name>", table_name = "<catalog_table_name>", additional_options = {"database": "<mongo_database_name>", "collection": "<mongo_db_collection>"}, transformation_ctx = "datasource0")

# Code to read data directly from mongo database
# datasource0 = glueContext.create_dynamic_frame_from_options(connection_type = "mongodb", connection_options = { "uri": "<connection_string>", "database": "<mongo_db_name>", "collection": "<mongo_collection>", "username": "<db_username>", "password": "<db_password>"})

# Code sample for resolveChoive (converted all the 'int' datatype to 'string'
# resolve_dyf = datasource0.resolveChoice(specs = [("nested.property", "cast:string"),("nested.further[].property", "cast:string")])

# Code sample to dropnullfields
# dyf_dropNullfields = DropNullFields.apply(frame = resolve_dyf, transformation_ctx = "dyf_dropNullfields")

data_sink0 = datasource0.repartition(1)
print("Repartition done")

# Code sample to sink using spark's write method
# data_sink0.write.format("json").option("header","true").save("s3://<s3_folder_path>")

datasink1 = glueContext.write_dynamic_frame.from_options(frame = data_sink0, connection_type = "s3", connection_options = {"path": "s3://<S3_folder_path>"}, format = "json", transformation_ctx = "datasink1")
print("Data Sink complete")
job.commit()

笔记

我不确定为什么会这样,因为这个问题是间歇性的。 有时它工作得很好,但有时它会失败。 所以这很混乱。

任何帮助将不胜感激。

我遇到了同样的问题。 对此的简单解决方案是将样本大小从 1000(MongoDB 的默认值)增加到 100000。添加示例配置以供参考。

`read_config = {
    "uri": documentdb_write_uri,
    "database": "your_db",
    "collection": "your_collection",
    "username": "user",
    "password": "password",
    "partitioner": "MongoSamplePartitioner",
    "sampleSize": "100000",
    "partitionerOptions.partitionSizeMB": "1000",
    "partitionerOptions.partitionKey": "_id"
}`

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM