如何更改 pyspark aws-glue 中的架构？

Question

I have a problem with a schema in pyspark.我在 pyspark 中的架构有问题。 I am trying to read a complete collection from MongoDB whith glue dynamic frame.我正在尝试使用胶水动态框架从 MongoDB 中读取完整的集合。

    read_mongo_options = {
   "uri": mongo_uri,
   "database": "database",
   "collection": collection,
   "ssl": "true",
   "username": "user",
   "password": "password",
   "sampleSize": "200000",
   "partitioner": "MongoSamplePartitioner",
   "partitionerOptions.partitionSizeMB": "10",
   "partitionerOptions.partitionKey": "_id"

}
datasource0 = glueContext.create_dynamic_frame.from_options(connection_type="mongodb",                   
                                                          connection_options=read_mongo_options,transformation_ctx = "datasource0")
df=datasource0.toDF()

I think it is not able to interpret a schema that has different values in the same field.我认为它无法解释在同一字段中具有不同值的模式。 This is a value of defaultPaymentsource in an example;这是一个示例中的 defaultPaymentsource 的值；

  "defaultPaymentsource": {
    "$oid": "58f0933bXXXXXXXXXX"
  },

And in other documents, this fields doesn't exist anymore After inferring the schema, this is the result:而在其他文档中，这个字段已经不存在了推断架构后，结果如下：

When I try to save or show results it gives me the following error当我尝试保存或显示结果时，它给了我以下错误

Cannot cast UNDEFINED into a StructType(StructField(oid,StringType,true))

I have set the value of samplesize to the maximum and I have changed the schema by removing the reference to 'undefined' but the result is always the same.我已将 samplesize 的值设置为最大值，并通过删除对“未定义”的引用来更改架构，但结果始终相同。 Any Idea?任何想法？

Answer 1

Well, it works now.好吧，它现在可以工作了。

Dropping column an create ney df删除列并创建 ney df

    df_no_p=DropFields.apply(a, paths=['defaultPaymentsource.undefined'])
    schema2=df_no_p.toDF().schema
    df = spark.read.format("mongo").load(schema=schema2)

如何更改 pyspark aws-glue 中的架构？

问题描述

1 个解决方案

解决方案1
0 2021-12-22 16:17:22

如何更改 pyspark aws-glue 中的架构？

问题描述

1 个解决方案

解决方案1 0 2021-12-22 16:17:22

解决方案1
0 2021-12-22 16:17:22