简体   繁体   English

来自 Pandas Dataframe 的 Python 对象列表

[英]List of Python Objects from Pandas Dataframe

What I want to do is take a data frame and turn each data frame row into a Python object namely RawData class presented.我想要做的是获取一个数据帧并将每个数据帧行转换为 Python object 即 RawData class 呈现。 Dataframe contains 10^5 - 10^6 rows. Dataframe 包含 10^5 - 10^6 行。

# Each row represents one RawData object
class RawData():
    id = models.UUIDField(primary_key=True, default=uuid.uuid4, editable=False)
    label_name = models.CharField(max_length=512, null=True
    currency = models.TextField(blank=True)
    content_id = models.TextField(blank=True)

@classmethod
def create_by_itertuples(cls, item):
    # item is namedtuple. converted to dict
    row = item._asdict()
    return (cls, row['Labels'], row['Currency'], row['Content_Id'])

@classmethod
def create_by_iterrows(cls, row):
    return (cls, row['Labels'], row['Currency'], row['Content_Id'])

@classmethod
def create_by_vectorization(cls, Labels, Currency, Content_Id):
    // How to proceed? 

I have tried iterrows and itertuples.我已经尝试过 iterrows 和 itertuples。


# sample dataframe
# initialize list of lists
data = [['T-Series', 'BDT', 'UX25437'], 
        ['Dragons Den', 'EUR', 'UF5432'], 
        ['A-Train', 'USD', 'GH5342']]
  
# Create the pandas DataFrame
df = pd.DataFrame(data, columns=['Labels', 'Currency', 'Content_Id'])

for index, row in df.iterrows():
    print(index)
    r = RawData.create(row)
    listofrows.append(r)

I did the same thing with itertuples which provided much better performance.我对 itertuples 做了同样的事情,它提供了更好的性能。

for item in df.itertuples():
   listofrows.append(RawData.create(row))


With the number of rows in mind, I am now trying NumPy vectorization.考虑到行数,我现在正在尝试 NumPy 向量化。 But having a hard time returning a list of objects from ndarrays.但是很难从 ndarrays 返回对象列表。

listofrows = (RawData.create( df[''].to_numpy(), df['property2'].to_numpy(),
 df['property3'].to_numpy()).to_list() 

If in create() I have to iterate over the arrays, I figure there's no advantage of doing vectorization.如果在 create() 中我必须遍历 arrays,我认为进行矢量化没有任何优势。 Can this be improved by vectorization?这可以通过矢量化来改善吗? Any help is appreciated.任何帮助表示赞赏。

Note: I am following this article .注意:我正在关注这篇文章

Edit: As vectorization is allowed only for primitives, are there any better ways for such a operation?编辑:由于矢量化只允许用于基元,有没有更好的方法来进行这种操作?

df["obj"] = df.apply(lambda row: (row["property1"], row["property2"], row["property3"]), axis=1)

Or if there is complex logic involved which can't be fit in one line, then create a function and call in place of lambda .或者,如果涉及到无法放在一行中的复杂逻辑,则创建一个 function 并调用代替lambda

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM