[英]Converting Python Class Object To A DataFrame
How do I convert a Python class object that has fields that instantiate other classes to a DataFrame? 如何将具有实例化其他类的字段的Python类对象转换为DataFrame? I tried the following code below but it does not work.
我在下面尝试了以下代码,但无法正常工作。
I can get it to work when I take out self.address = Address()
and self.agency_contact_info = ContactInfo()
当我取出
self.address = Address()
和self.agency_contact_info = ContactInfo()
时,它可以工作
class Address:
def __init__(self):
self.address_one = "address 1"
self.address_two = "P.O. BOX 1"
class ContactInfo:
def __init__(self):
self.person_name = "Me"
self.phone_number = "999-999-9999"
class AgencyRecord:
def __init__(self):
self.agency_code = "00"
self.agency_id = "000"
self.agency_name = "Some Agency"
self.address = Address()
self.agency_contact_info = ContactInfo()
def create_data():
data = {}
for i in range(0, 3):
alc = AgencyRecord()
data[i] = alc
column_list = [
'agency_code', 'agency_id', 'agency_name',
'address_one', 'address_two', 'person_name', 'phone_number'
]
spark.createDataFrame(
list(data.values()),
column_list
).createOrReplaceTempView("MyTempTable")
Quoting myself again: 再次引用自己 :
I find it's useful to think of the argument to createDataFrame() as a list of [iterables] where each entry in the list corresponds to a row in the DataFrame and each element of the [iterable] corresponds to a column.
我发现将createDataFrame()的参数视为[iterables]的列表很有用,其中列表中的每个条目都对应于DataFrame中的一行,而[iterable]的每个元素都对应于一列。
So you need to convert each of your objects into an interable where each element corresponds to the columns in column_list
. 因此,您需要将每个对象转换为一个互变量,其中每个元素都与
column_list
的列相对应。
I wouldn't necessarily endorse it (there's almost surely a better way), but here is one hacky approach you can take to modify your code accordingly: 我不一定会认可它(几乎肯定会有更好的方法),但是您可以采取以下一种骇人听闻的方法来相应地修改代码:
You can take advantage of the fact that python objects have a self.__dict__
that you can use to retrieve parameters by name. 您可以利用python对象具有
self.__dict__
的事实,可以通过名称检索参数。 First, update your AgencyRecord
class to pull in the fields from the Address
and ContactInfo
classes: 首先,更新您的
AgencyRecord
类以从Address
和ContactInfo
类中提取字段:
class AgencyRecord:
def __init__(self):
self.agency_code = "00"
self.agency_id = "000"
self.agency_name = "Some Agency"
self.address = Address()
self.agency_contact_info = ContactInfo()
# makes the variables of the contained classes members of this class
self.__dict__.update(self.address.__dict__)
self.__dict__.update(self.agency_contact_info.__dict__)
Now we can reference each column in column_list
by name for any instance of an AgencyRecord
. 现在,我们可以按名称引用
AgencyRecord
任何实例的column_list
的每一列。
Modify the create_data
as follows (I've also changed this to return a DataFrame, rather than registering a temp view) 如下修改
create_data
(我也将其更改为返回DataFrame,而不是注册临时视图)
def create_data():
data = {}
for i in range(0, 3):
alc = AgencyRecord()
data[i] = alc
column_list = [
'agency_code', 'agency_id', 'agency_name',
'address_one', 'address_two', 'person_name', 'phone_number'
]
values = [
[data[record].__dict__[c] for c in column_list]
for record in data
]
return spark.createDataFrame(values, column_list)
Now you can do: 现在您可以执行以下操作:
temp_df = create_data()
temp_df.show()
#+-----------+---------+-----------+-----------+-----------+-----------+------------+
#|agency_code|agency_id|agency_name|address_one|address_two|person_name|phone_number|
#+-----------+---------+-----------+-----------+-----------+-----------+------------+
#| 00| 000|Some Agency| address 1| P.O. BOX 1| Me|999-999-9999|
#| 00| 000|Some Agency| address 1| P.O. BOX 1| Me|999-999-9999|
#| 00| 000|Some Agency| address 1| P.O. BOX 1| Me|999-999-9999|
#+-----------+---------+-----------+-----------+-----------+-----------+------------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.