![](/img/trans.png)
[英]AWS Glue Job : An error occurred while calling getCatalogSource. None.get
[英]AWS GLUE ERROR : An error occurred while calling o75.pyWriteDynamicFrame. Cannot cast STRING into a IntegerType (value: BsonString{value=''})
我有一个简单的 glue pyspark 作业,它通过 glue 目录表连接到 Mongodb 源,并使用 glue 动态框架从 Mongodb collections 中提取数据并写入 json output 到 s3。 这里的 Mongo 数据库是深度嵌套的 no sql with structs 和 arrays。由于它是一个 no-sql db,源模式是不固定的。 嵌套的列可能因文档而异。 但是,作业失败并出现以下错误。
错误: py4j.protocol.Py4JJavaError: An error occurred while calling o75.pyWriteDynamicFrame.: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 6, 10.3.29.22, executor 1): com.mongodb.spark.exceptions.MongoTypeConversionException: Cannot cast STRING into a IntegerType (value: BsonString{value=''})
因为,由于数据类型不匹配的原因,作业失败了,我已经尝试了所有可能的解决方案,比如使用resolveChoice()
。 由于错误是针对具有“int”数据类型的属性,我尝试将所有具有“int”类型的属性转换为“string”。 我还尝试了dropnullfields
的代码, writing with spark dataframe
, applymapping
, without using catalog table (from_options directly from mongo table)
, with and without repartition
repartition。 所有这些尝试都在代码中进行了注释以供参考。
代码片段
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame
from awsglue.job import Job
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
print("Started")
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "<catalog_db_name>", table_name = "<catalog_table_name>", additional_options = {"database": "<mongo_database_name>", "collection": "<mongo_db_collection>"}, transformation_ctx = "datasource0")
# Code to read data directly from mongo database
# datasource0 = glueContext.create_dynamic_frame_from_options(connection_type = "mongodb", connection_options = { "uri": "<connection_string>", "database": "<mongo_db_name>", "collection": "<mongo_collection>", "username": "<db_username>", "password": "<db_password>"})
# Code sample for resolveChoive (converted all the 'int' datatype to 'string'
# resolve_dyf = datasource0.resolveChoice(specs = [("nested.property", "cast:string"),("nested.further[].property", "cast:string")])
# Code sample to dropnullfields
# dyf_dropNullfields = DropNullFields.apply(frame = resolve_dyf, transformation_ctx = "dyf_dropNullfields")
data_sink0 = datasource0.repartition(1)
print("Repartition done")
# Code sample to sink using spark's write method
# data_sink0.write.format("json").option("header","true").save("s3://<s3_folder_path>")
datasink1 = glueContext.write_dynamic_frame.from_options(frame = data_sink0, connection_type = "s3", connection_options = {"path": "s3://<S3_folder_path>"}, format = "json", transformation_ctx = "datasink1")
print("Data Sink complete")
job.commit()
笔记
我不确定为什么会这样,因为这个问题是间歇性的。 有时它工作得很好,但有时它会失败。 所以这很混乱。
任何帮助将不胜感激。
我遇到了同样的问题。 对此的简单解决方案是将样本大小从 1000(MongoDB 的默认值)增加到 100000。添加示例配置以供参考。
`read_config = {
"uri": documentdb_write_uri,
"database": "your_db",
"collection": "your_collection",
"username": "user",
"password": "password",
"partitioner": "MongoSamplePartitioner",
"sampleSize": "100000",
"partitionerOptions.partitionSizeMB": "1000",
"partitionerOptions.partitionKey": "_id"
}`
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.