How do I convert a Python class object that has fields that instantiate other classes to a DataFrame? I tried the following code below but it does not work.
I can get it to work when I take out self.address = Address()
and self.agency_contact_info = ContactInfo()
class Address:
def __init__(self):
self.address_one = "address 1"
self.address_two = "P.O. BOX 1"
class ContactInfo:
def __init__(self):
self.person_name = "Me"
self.phone_number = "999-999-9999"
class AgencyRecord:
def __init__(self):
self.agency_code = "00"
self.agency_id = "000"
self.agency_name = "Some Agency"
self.address = Address()
self.agency_contact_info = ContactInfo()
def create_data():
data = {}
for i in range(0, 3):
alc = AgencyRecord()
data[i] = alc
column_list = [
'agency_code', 'agency_id', 'agency_name',
'address_one', 'address_two', 'person_name', 'phone_number'
]
spark.createDataFrame(
list(data.values()),
column_list
).createOrReplaceTempView("MyTempTable")
Quoting myself again:
I find it's useful to think of the argument to createDataFrame() as a list of [iterables] where each entry in the list corresponds to a row in the DataFrame and each element of the [iterable] corresponds to a column.
So you need to convert each of your objects into an interable where each element corresponds to the columns in column_list
.
I wouldn't necessarily endorse it (there's almost surely a better way), but here is one hacky approach you can take to modify your code accordingly:
You can take advantage of the fact that python objects have a self.__dict__
that you can use to retrieve parameters by name. First, update your AgencyRecord
class to pull in the fields from the Address
and ContactInfo
classes:
class AgencyRecord:
def __init__(self):
self.agency_code = "00"
self.agency_id = "000"
self.agency_name = "Some Agency"
self.address = Address()
self.agency_contact_info = ContactInfo()
# makes the variables of the contained classes members of this class
self.__dict__.update(self.address.__dict__)
self.__dict__.update(self.agency_contact_info.__dict__)
Now we can reference each column in column_list
by name for any instance of an AgencyRecord
.
Modify the create_data
as follows (I've also changed this to return a DataFrame, rather than registering a temp view)
def create_data():
data = {}
for i in range(0, 3):
alc = AgencyRecord()
data[i] = alc
column_list = [
'agency_code', 'agency_id', 'agency_name',
'address_one', 'address_two', 'person_name', 'phone_number'
]
values = [
[data[record].__dict__[c] for c in column_list]
for record in data
]
return spark.createDataFrame(values, column_list)
Now you can do:
temp_df = create_data()
temp_df.show()
#+-----------+---------+-----------+-----------+-----------+-----------+------------+
#|agency_code|agency_id|agency_name|address_one|address_two|person_name|phone_number|
#+-----------+---------+-----------+-----------+-----------+-----------+------------+
#| 00| 000|Some Agency| address 1| P.O. BOX 1| Me|999-999-9999|
#| 00| 000|Some Agency| address 1| P.O. BOX 1| Me|999-999-9999|
#| 00| 000|Some Agency| address 1| P.O. BOX 1| Me|999-999-9999|
#+-----------+---------+-----------+-----------+-----------+-----------+------------+
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.