简体   繁体   English

如何更改 pyspark aws-glue 中的架构?

[英]How I Can change schema in pyspark aws-glue?

I have a problem with a schema in pyspark.我在 pyspark 中的架构有问题。 I am trying to read a complete collection from MongoDB whith glue dynamic frame.我正在尝试使用胶水动态框架从 MongoDB 中读取完整的集合。

    read_mongo_options = {
   "uri": mongo_uri,
   "database": "database",
   "collection": collection,
   "ssl": "true",
   "username": "user",
   "password": "password",
   "sampleSize": "200000",
   "partitioner": "MongoSamplePartitioner",
   "partitionerOptions.partitionSizeMB": "10",
   "partitionerOptions.partitionKey": "_id"

}
datasource0 = glueContext.create_dynamic_frame.from_options(connection_type="mongodb",                   
                                                          connection_options=read_mongo_options,transformation_ctx = "datasource0")
df=datasource0.toDF()

I think it is not able to interpret a schema that has different values in the same field.我认为它无法解释在同一字段中具有不同值的模式。 This is a value of defaultPaymentsource in an example;这是一个示例中的 defaultPaymentsource 的值;

  "defaultPaymentsource": {
    "$oid": "58f0933bXXXXXXXXXX"
  },

And in other documents, this fields doesn't exist anymore After inferring the schema, this is the result:而在其他文档中,这个字段已经不存在了 推断架构后,结果如下:

在此处输入图像描述

When I try to save or show results it gives me the following error当我尝试保存或显示结果时,它给了我以下错误

Cannot cast UNDEFINED into a StructType(StructField(oid,StringType,true))

I have set the value of samplesize to the maximum and I have changed the schema by removing the reference to 'undefined' but the result is always the same.我已将 samplesize 的值设置为最大值,并通过删除对“未定义”的引用来更改架构,但结果始终相同。 Any Idea?任何想法?

Well, it works now.好吧,它现在可以工作了。

Dropping column an create ney df删除列并创建 ney df

    df_no_p=DropFields.apply(a, paths=['defaultPaymentsource.undefined'])
    schema2=df_no_p.toDF().schema
    df = spark.read.format("mongo").load(schema=schema2)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM