简体   繁体   English

AWS胶水大多为空字段

[英]aws glue dropping mostly null fields

I have a dataframe df. 我有一个数据框df。 It has a couple columns that are mostly null. 它有几列,大部分为空。 I'm writing it to an s3 bucket using the code below. 我正在使用以下代码将其写入s3存储桶。 I then crawl the s3 bucket to get the table schema in the datacatalog. 然后,我对s3存储桶进行爬网以在datacatalog中获取表模式。 I'm finding when I crawl the data the fields that are mostly null get dropped. 我在搜寻数据时发现大部分为空的字段都被删除了。 I've checked the json that is output and I'm finding that some records have the field, and others don't. 我检查了输出的json,发现某些记录包含该字段,而另一些则没有。 Does anyone know what the issue might be? 有人知道这个问题可能是什么吗? I would like to include the fields even if they are mostly null. 我想包括这些字段,即使它们大多为空。

Code: 码:

# importing libraries

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

glueContext = GlueContext(SparkContext.getOrCreate())

from pyspark.sql.functions import col
from pyspark.sql.functions import first
from pyspark.sql.functions  import date_format
from pyspark.sql.functions import lit,StringType
from pyspark.sql.types import *
from pyspark.sql.functions import to_date,format_number,dayofmonth,hour,dayofyear,month,year,weekofyear,date_format,unix_timestamp
from pyspark.sql.functions import *


# write to table
df.write.json('s3://path/table')

Why not use AWS Glue write method instead of spark DF? 为什么不使用AWS Glue写入方法代替spark DF?

glueContext.write_dynamic_frame.from_options gumContext.write_dynamic_frame.from_options

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM