简体   繁体   中英

aws glue dropping mostly null fields

I have a dataframe df. It has a couple columns that are mostly null. I'm writing it to an s3 bucket using the code below. I then crawl the s3 bucket to get the table schema in the datacatalog. I'm finding when I crawl the data the fields that are mostly null get dropped. I've checked the json that is output and I'm finding that some records have the field, and others don't. Does anyone know what the issue might be? I would like to include the fields even if they are mostly null.

Code:

# importing libraries

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

glueContext = GlueContext(SparkContext.getOrCreate())

from pyspark.sql.functions import col
from pyspark.sql.functions import first
from pyspark.sql.functions  import date_format
from pyspark.sql.functions import lit,StringType
from pyspark.sql.types import *
from pyspark.sql.functions import to_date,format_number,dayofmonth,hour,dayofyear,month,year,weekofyear,date_format,unix_timestamp
from pyspark.sql.functions import *


# write to table
df.write.json('s3://path/table')

Why not use AWS Glue write method instead of spark DF?

glueContext.write_dynamic_frame.from_options

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM