Remove all StructType columns from PySpark DataFrame

Question

I have a data frame df that reads a JSON file as follows:

df = spark.read.json("/myfiles/file1.json")

df.dtypes shows the following columns and data types:

 id – string Name - struct address - struct Phone - struct start_date - string years_with_company - int highest_education - string department - string reporting_hierarchy - struct

I want to extract only non-struct columns and create a data frame. For example, my resulting data frame should only have id , start_date , highest_education , and department .

Here is the code I have which partially works, as I only get the last non-struct column department 's values populated in it. I want to get all non-struct type columns collected and then converted to data frame:

names = df.schema.names

for col_name in names:
   if isinstance(df.schema[col_name].dataType, StructType):
      print("Skipping struct column %s "%(col_name))
   else:
      df1 = df.select(col_name).collect()

I'm pretty sure this may not be the best way to do it and I missing something that I cannot put my finger on, so I would appreciate your help. Thank you.

Answer 1

Use a list comprehension:

cols_filtered = [
    c for c in df.schema.names 
    if not isinstance(df.schema[c].dataType, StructType) 
]

Or,

# Thank you @pault for the suggestion!
cols_filtered = [c for c, t in df.dtypes if t != 'struct']

Now, you can pass the result to df.select .

df2 = df.select(*cols_filtered)

Remove all StructType columns from PySpark DataFrame

Question

1 answers

solution1
4 ACCPTED 2018-12-16 02:42:29

Remove all StructType columns from PySpark DataFrame

Question

1 answers

solution1 4 ACCPTED 2018-12-16 02:42:29

solution1
4 ACCPTED 2018-12-16 02:42:29