I have a data frame df
that reads a JSON file as follows:
df = spark.read.json("/myfiles/file1.json")
df.dtypes
shows the following columns and data types:
id – string Name - struct address - struct Phone - struct start_date - string years_with_company - int highest_education - string department - string reporting_hierarchy - struct
I want to extract only non-struct columns and create a data frame. For example, my resulting data frame should only have id
, start_date
, highest_education
, and department
.
Here is the code I have which partially works, as I only get the last non-struct column department
's values populated in it. I want to get all non-struct type columns collected and then converted to data frame:
names = df.schema.names
for col_name in names:
if isinstance(df.schema[col_name].dataType, StructType):
print("Skipping struct column %s "%(col_name))
else:
df1 = df.select(col_name).collect()
I'm pretty sure this may not be the best way to do it and I missing something that I cannot put my finger on, so I would appreciate your help. Thank you.
Use a list comprehension:
cols_filtered = [
c for c in df.schema.names
if not isinstance(df.schema[c].dataType, StructType)
]
Or,
# Thank you @pault for the suggestion!
cols_filtered = [c for c, t in df.dtypes if t != 'struct']
Now, you can pass the result to df.select
.
df2 = df.select(*cols_filtered)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.