简体   繁体   中英

Remove all StructType columns from PySpark DataFrame

I have a data frame df that reads a JSON file as follows:

df = spark.read.json("/myfiles/file1.json")

df.dtypes shows the following columns and data types:

 id – string Name - struct address - struct Phone - struct start_date - string years_with_company - int highest_education - string department - string reporting_hierarchy - struct 

I want to extract only non-struct columns and create a data frame. For example, my resulting data frame should only have id , start_date , highest_education , and department .

Here is the code I have which partially works, as I only get the last non-struct column department 's values populated in it. I want to get all non-struct type columns collected and then converted to data frame:

names = df.schema.names

for col_name in names:
   if isinstance(df.schema[col_name].dataType, StructType):
      print("Skipping struct column %s "%(col_name))
   else:
      df1 = df.select(col_name).collect() 

I'm pretty sure this may not be the best way to do it and I missing something that I cannot put my finger on, so I would appreciate your help. Thank you.

Use a list comprehension:

cols_filtered = [
    c for c in df.schema.names 
    if not isinstance(df.schema[c].dataType, StructType) 
]    

Or,

# Thank you @pault for the suggestion!
cols_filtered = [c for c, t in df.dtypes if t != 'struct']

Now, you can pass the result to df.select .

df2 = df.select(*cols_filtered)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM