Python 用于构建“pandas.DataFrame”的协议

Question

Hello SO and community!你好 SO 和社区！

Guess, my question somewhat resonates with this one .猜猜，我的问题与这个问题有些共鸣。

However, trust the below task is a little bit different from that referenced above, namely to extract, transform, load data utilizing pandas.DataFrame , and I am stuck implementing Protocol for the purpose.但是，相信下面的任务与上面提到的有点不同，即利用pandas.DataFrame提取、转换、加载数据，我一直坚持为此目的实现Protocol 。

The code is below:代码如下：

import io
import pandas as pd
import re
import requests
from functools import cache
from typing import Protocol
from zipfile import ZipFile
from pandas import DataFrame


@cache
def extract_can_from_url(url: str, **kwargs) -> DataFrame:
    '''
    Returns DataFrame from downloaded zip file from url
    Parameters
    ----------
    url : str
        url to download from.
    **kwargs : TYPE
        additional arguments to pass to pd.read_csv().
    Returns
    -------
    DataFrame
    '''
    name = url.split('/')[-1]
    if os.path.exists(name):
        with ZipFile(name, 'r').open(name.replace('-eng.zip', '.csv')) as f:
            return pd.read_csv(f, **kwargs)
    else:
        r = requests.get(url)
        with ZipFile(io.BytesIO(r.content)).open(name.replace('-eng.zip', '.csv')) as f:
            return pd.read_csv(f, **kwargs)


class ETL(Protocol):
    # =============================================================================
    # Maybe Using these items for dataclass:
    # url: str
    # meta: kwargs(default_factory=dict)
    # =============================================================================
    def __init__(self, url: str, **kwargs) -> None:
        return None

    def download(self) -> DataFrame:
        return DataFrame

    def retrieve_series_ids(self) -> list[str]:
        return list[str]

    def transform(self) -> DataFrame:
        return DataFrame

    def sum_up_series_ids(self) -> DataFrame:
        return DataFrame


class ETLCanadaFixedAssets(ETL):
    def __init__(self, url: str, **kwargs) -> None:
        self.url = url
        self.kwargs = kwargs

    @cache
    def download(self) -> DataFrame:
        self.df = extract_can_from_url(URL, index_col=0, usecols=range(14))
        return self.df

    def retrieve_series_ids(self) -> list[str]:
        # =========================================================================
        # Columns Specific to URL below, might be altered
        # =========================================================================
        self._columns = {
            "Prices": 0,
            "Industry": 1,
            "Flows and stocks": 2,
            "VECTOR": 3,
        }
        self.df_cut = self.df.loc[:, tuple(self._columns)]
        _q = (self.df_cut.iloc[:, 0].str.contains('2012 constant prices')) & \
            (self.df_cut.iloc[:, 1].str.contains('manufacturing', flags=re.IGNORECASE)) & \
            (self.df_cut.iloc[:, 2] == 'Linear end-year net stock')
        self.df_cut = self.df_cut[_q]
        self.series_ids = sorted(set(self.df_cut.iloc[:, -1]))
        return self.series_ids

    def transform(self) -> DataFrame:
        # =========================================================================
        # Columns Specific to URL below, might be altered
        # =========================================================================
        self._columns = {
            "VECTOR": 0,
            "VALUE": 1,
        }
        self.df = self.df.loc[:, tuple(self._columns)]
        self.df = self.df[self.df.iloc[:, 0].isin(self.series_ids)]
        return self.df

    def sum_up_series_ids(self) -> DataFrame:
        self.df = pd.concat(
            [
                self.df[self.df.iloc[:, 0] == series_id].iloc[:, [1]]
                for series_id in self.series_ids
            ],
            axis=1
        )
        self.df.columns = self.series_ids
        self.df['sum'] = self.df.sum(axis=1)
        return self.df.iloc[:, [-1]]

UPD UPD

Instantiating the class ETLCanadaFixedAssets实例化 class ETLCanadaFixedAssets

df = ETLCanadaFixedAssets(URL, index_col=0, usecols=range(14)).download().retrieve_series_ids().transform().sum_up_series_ids()

returns an error, however, expected:但是，预期会返回错误：

AttributeError: 'DataFrame' object has no attribute 'retrieve_series_ids'

Please can anyone provide a guidance for how to put these things together (namely how to retrieve the DataFrame which might have been retrieved otherwise using the procedural approach by calling the functions within the last class as they appear within the latter) and point at those mistakes which were made above?请任何人提供有关如何将这些东西放在一起的指导（即如何检索DataFrame ，否则可能已通过调用最后一个class中出现的函数来使用程序方法检索，因为它们出现在后者中）并指出这些错误上面做了哪些？

Probably, there is another way to do this elegantly using injection.可能还有另一种使用注入优雅地做到这一点的方法。

Thank you very much in advance!非常感谢您！

Answer 1

All the functions of ETLCanadaFixedAssets and ETL should return self so you can chain them together. ETLCanadaFixedAssets 和 ETL 的所有函数都应该返回 self，以便您可以将它们链接在一起。 You could add one more function that retrieves the encapsulated dataframe but that will always be called last, as the moment you call this function you cannot chain other functions any more.您可以再添加一个 function 来检索封装的 dataframe ，但它始终会被最后调用，因为当您调用此 function 时，您不能再使用其他任何函数。 What you are trying to build is called fluent API you may read more about it here您正在尝试构建的内容称为流利的 API 您可以在此处阅读有关它的更多信息

Python 用于构建“pandas.DataFrame”的协议

问题描述

1 个解决方案

解决方案1
0 2022-09-12 22:06:56

Python 用于构建“pandas.DataFrame”的协议

问题描述

1 个解决方案

解决方案1 0 2022-09-12 22:06:56

解决方案1
0 2022-09-12 22:06:56