[英]List of Python Objects from Pandas Dataframe
What I want to do is take a data frame and turn each data frame row into a Python object namely RawData class presented.我想要做的是获取一个数据帧并将每个数据帧行转换为 Python object 即 RawData class 呈现。 Dataframe contains 10^5 - 10^6 rows.
Dataframe 包含 10^5 - 10^6 行。
# Each row represents one RawData object
class RawData():
id = models.UUIDField(primary_key=True, default=uuid.uuid4, editable=False)
label_name = models.CharField(max_length=512, null=True
currency = models.TextField(blank=True)
content_id = models.TextField(blank=True)
@classmethod
def create_by_itertuples(cls, item):
# item is namedtuple. converted to dict
row = item._asdict()
return (cls, row['Labels'], row['Currency'], row['Content_Id'])
@classmethod
def create_by_iterrows(cls, row):
return (cls, row['Labels'], row['Currency'], row['Content_Id'])
@classmethod
def create_by_vectorization(cls, Labels, Currency, Content_Id):
// How to proceed?
I have tried iterrows and itertuples.我已经尝试过 iterrows 和 itertuples。
# sample dataframe
# initialize list of lists
data = [['T-Series', 'BDT', 'UX25437'],
['Dragons Den', 'EUR', 'UF5432'],
['A-Train', 'USD', 'GH5342']]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns=['Labels', 'Currency', 'Content_Id'])
for index, row in df.iterrows():
print(index)
r = RawData.create(row)
listofrows.append(r)
I did the same thing with itertuples which provided much better performance.我对 itertuples 做了同样的事情,它提供了更好的性能。
for item in df.itertuples():
listofrows.append(RawData.create(row))
With the number of rows in mind, I am now trying NumPy vectorization.考虑到行数,我现在正在尝试 NumPy 向量化。 But having a hard time returning a list of objects from ndarrays.
但是很难从 ndarrays 返回对象列表。
listofrows = (RawData.create( df[''].to_numpy(), df['property2'].to_numpy(),
df['property3'].to_numpy()).to_list()
If in create() I have to iterate over the arrays, I figure there's no advantage of doing vectorization.如果在 create() 中我必须遍历 arrays,我认为进行矢量化没有任何优势。 Can this be improved by vectorization?
这可以通过矢量化来改善吗? Any help is appreciated.
任何帮助表示赞赏。
Note: I am following this article .注意:我正在关注这篇文章。
Edit: As vectorization is allowed only for primitives, are there any better ways for such a operation?编辑:由于矢量化只允许用于基元,有没有更好的方法来进行这种操作?
df["obj"] = df.apply(lambda row: (row["property1"], row["property2"], row["property3"]), axis=1)
Or if there is complex logic involved which can't be fit in one line, then create a function and call in place of lambda
.或者,如果涉及到无法放在一行中的复杂逻辑,则创建一个 function 并调用代替
lambda
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.