简体   繁体   中英

Python pandas: construct list dataclass objects from each row of a dataframe

A consistent answer seems to be to avoid iterating over rows while working with Pandas. I'd like to understand how I can do so in the following case.

from typing import List

@dataclass
class Person:
    id: int
    name: str
    age: int

persons_df = pd.DataFrame(data={'id': [1, 2, 3], 'name': ['A', 'B', 'C'], 'age': [32, 44, '86']})

persons_list: List[Person] = [] #populate this list with Person objects, created from the dataframe above

# my approach is to use iterrows()
for row in persons_df.itertuples():
    person = Person(row.id, row.name, int(row.age)) # type: ignore
    plist.append(person)

I'd like to find an option which can avoid the iterrows, and if possible, be done in a manner that has some type safety built in (avoid the mypy ignore comment).

thanks!

I am not sure if thats what you are looking for, but maybe this helps:

import pandas as pd
df = pd.DataFrame(data={'id': [1, 2, 3], 'name': ['A', 'B', 'C'], 'age': [32, 44, '86']})

class Person:
    def __init__(self, lst):
        self.id = lst[0]
        self.name = lst[1]
        self.age = lst[2]

df.apply(Person, axis=1).tolist()

out:

[<__main__.Person at 0x176eee70608>,
 <__main__.Person at 0x176eee704c8>,
 <__main__.Person at 0x176eee70388>]

I add a new answer, because the title of the question is map dataframe rows to a list of dataclass objects , and this has not been addressed yet.

To return dataclasses, we can slightly improve @Andreas answer , without requiring an additional constructor receiving a list. We just have to use Python spread operators.

I see two ways of mapping:

  1. The dataframe column names match the data class field names. In this case, we can ask to map our row as a set of keyword arguments: df.apply(lambda row: MyDataClass(**row), axis=1)
  2. The dataframe column names does not match data class field names, but column order match dataclass field order . In this case, we can ask that our row values are passed as a list of ordered arguments: df.apply(lambda row: MyDataClass(*row), axis=1)

Example:

  1. Define same data class and same dataframe as in the question:
     from dataclasses import dataclass @dataclass class Person: id: int name: str age: int import pandas df = pandas.DataFrame(data={ 'id': [1, 2, 3], 'name': ['A', 'B', 'C'], 'age': [32, 44, '86'] })
  2. Conversion based on column order:
     persons = df.apply(lambda row: Person(*row), axis=1)
  3. Conversion based on column names (column order is shuffled for a better test):
     persons = df[['age', 'id', 'name']].apply(lambda row: Person(**row), axis=1)
  4. Now, we can verify our result. In both cases above:
    • This snippet:
       print(type(persons)) print(persons)
    • prints:
       <class 'pandas.core.series.Series'> 0 Person(id=1, name='A', age=32) 1 Person(id=2, name='B', age=44) 2 Person(id=3, name='C', age='86') dtype: object

WARNINGS:

  • I have no idea of the performance of this solution
  • This does not enforce any type checking (look at last person printed: its age is a text). As Python does not enforce typing by default, this quick solution does not bring any additional safety.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM