dict attribute 'type' to select Subclass of dataclass

Question

I have the following class

@dataclass_json
@dataclass
class Source:
    type: str =None
    label: str =None
    path: str = None

and the two subclasses:

@dataclass_json
@dataclass
class Csv(Source):
    csv_path: str=None
    delimiter: str=';'

and

@dataclass_json
@dataclass
class Parquet(Source):
    parquet_path: str=None

Given now the dictionary:

parquet={type: 'Parquet', label: 'events', path: '/.../test.parquet', parquet_path: '../../result.parquet'}

csv={type: 'Csv', label: 'events', path: '/.../test.csv', csv_path: '../../result.csv', delimiter:','}

Now I would like to do something like

Source().from_dict(csv)

and that the output will be the class Csv or Parquet. I understand that if you initiate the class source you just "upload" the parameters with the method "from dict", but is there any posibility in doing this by some type of inheritence without using a "Constructor" which makes a if-else if-else over all possible 'types'?

Pureconfig, a Scala Library, creates different case classes when the attribute 'type' has the name of the desired subclass. In Python this is possible?

Answer 1

You can build a helper that picks and instantiates the appropriate subclass.

def from_data(data: dict, tp: type):
    """Create the subtype of ``tp`` for the given ``data``"""
    subtype = [
        stp for stp in tp.__subclasses__()  # look through all subclasses...
        if stp.__name__ == data['type']     # ...and select by type name
    ][0]
    return subtype(**data)  # instantiate the subtype

This can be called with your data and the base class from which to select:

>>> from_data(
...     {'type': 'Csv', 'label': 'events', 'path': '/.../test.csv', 'csv_path': '../../result.csv', 'delimiter':','},
...     Source,
... )
Csv(type='Csv', label='events', path='/.../test.csv', csv_path='../../result.csv', delimiter=',')

If you need to run this often, it is worth building a dict to optimise the subtype lookup. A simple means is to add a method to your base class, and store the lookup there:

@dataclass_json
@dataclass
class Source:
    type: str =None
    label: str =None
    path: str = None

    @classmethod
    def from_data(cls, data: dict):
        if not hasattr(cls, '_lookup'):
            cls._lookup = {stp.__name__: stp for stp in cls.__subclasses__()}
        return cls._lookup[data["type"]](**data)

This can be called directly on the base class:

>>> Source.from_data({'type': 'Csv', 'label': 'events', 'path': '/.../test.csv', 'csv_path': '../../result.csv', 'delimiter':','})
Csv(type='Csv', label='events', path='/.../test.csv', csv_path='../../result.csv', delimiter=',')

Answer 2

This is a variation on my answer to this question .

@dataclass_json
@dataclass
class Source:
    type: str = None
    label: str = None
    path: str = None

    def __new__(cls, type=None, **kwargs):
        for subclass in cls.__subclasses__():
            if subclass.__name__ == type:
                break
        else:
            subclass = cls
        instance = super(Source, subclass).__new__(subclass)
        return instance

assert type(Source(**csv)) == Csv
assert type(Source(**parquet)) == Parquet
assert Csv(**csv) == Source(**csv)
assert Parquet(**parquet) == Source(**parquet)

You asked and I am happy to oblige. However, I'm questioning whether this is really what you need. I think it might be overkill for your situation. I originally figured this trick out so I could instantiate directly from data when...

my data was heterogeneous and I didn't know ahead of time which subclass was appropriate for each datum,
I didn't have control over the data, and
figuring out which subclass to use required some processing of the data, processing which I felt belonged inside the class (for logical reasons as well as to avoid polluting the scope in which the instantiating took place).

If those conditions apply to your situation, then I think this is a worth-while approach. If not, the added complexity of mucking with __new__ -- a moderately advanced maneuver -- might not outweigh the savings in complexity in the code used to instantiate. There are probably simpler alternatives.

For example, it appears as though you already know which subclass you need; it's one of the fields in the data. If you put it there, presumably whatever logic you wrote to do so could be used to instantiate the appropriate subclass right then and there, bypassing the need for my solution. Alternatively, instead of storing the name of the subclass as a string, store the subclass itself. Then you could do this: data['type'](**data)

It also occurs to me that maybe you don't need inheritance at all. Do Csv and Parquet store the same type of data, differing only in which file format they read it from? Then maybe you just need one class with from_csv and from_parquet methods. Alternatively, if one of the parameters is a filename, it would be easy to figure out which type of file parsing you need based on the filename extension. Normally I'd put this in __init__ , but since you're using dataclass , I guess this would happen in __post_init__ .

Answer 3

Do you need this behavior?

from dataclasses import dataclass
from typing import Optional, Union, List

from validated_dc import ValidatedDC


@dataclass
class Source(ValidatedDC):
    label: Optional[str] = None
    path: Optional[str] = None


@dataclass
class Csv(Source):
    csv_path: Optional[str] = None
    delimiter: str = ';'


@dataclass
class Parquet(Source):
    parquet_path: Optional[str] = None


@dataclass
class InputData(ValidatedDC):
    data: List[Union[Parquet, Csv]]


# Let's say you got a json-string and loaded it:
data = [
    {
        'label': 'events', 'path': '/.../test.parquet',
        'parquet_path': '../../result.parquet'
    },
    {
        'label': 'events', 'path': '/.../test.csv',
        'csv_path': '../../result.csv', 'delimiter': ','
    }

]


input_data = InputData(data=data)

for item in input_data.data:
    print(item)

# Parquet(label='events', path='/.../test.parquet', parquet_path='../../result.parquet')
# Csv(label='events', path='/.../test.csv', csv_path='../../result.csv', delimiter=',')

validated_dc: https://github.com/EvgeniyBurdin/validated_dc

dict attribute 'type' to select Subclass of dataclass

Question

3 answers

solution1
2 2020-04-21 09:42:34

solution2
1 2020-04-29 18:41:57

solution3
0 2020-05-03 14:44:28

dict attribute 'type' to select Subclass of dataclass

Question

3 answers

solution1 2 2020-04-21 09:42:34

solution2 1 2020-04-29 18:41:57

solution3 0 2020-05-03 14:44:28

solution1
2 2020-04-21 09:42:34

solution2
1 2020-04-29 18:41:57

solution3
0 2020-05-03 14:44:28