简体   繁体   中英

Pythonic way to declare multiple empty dataframes in a class?

I have a class like below. I am wondering what is the most pythonic way to declare and initialize multiple empty dataframes?

import pandas as pd

class ReadData:

  def __init__(self, input_dir):
      self.df1 = pd.DataFrame(data=None)
      self.df2 = pd.DataFrame(data=None)
      self.df3 = pd.DataFrame(data=None)
      self.input_dir = input_dir
    
  
  def read_inputs():
     self.df1 = pd.read_csv(self.input_dir+"/file1.csv")
     self.df2 = pd.read_csv(self.input_dir+"/file2.csv")
     self.df3 = pd.read_csv(self.input_dir+"/file3.csv")

ReadData("./").read_inputs()

In general, dataframes are not supposed to be initialized empty and appended to (appending to dataframes is a slow memory intensive operation). You'll be better off storing your data in structures that can append data quickly such as a list .

However, to answer your question, you can use a dictionary comprehension and keep your dataframes in a dictionary. Or you can do the same with a list.

import pandas as pd

class Data:
    def __init__(self):
        self.dfs = {
            "df{}".format(i): pd.DataFrame(data=None)
            for i in range(3)
        }

Then you can access your data likeso:

data = Data()
data.dfs["df1"]

Though the power of using a dictionary is that you can explicitly name your data. So a structure like this may be more intuitive:

class Data:
    def __init__(self, df_names):
        self.dfs = {
            name: pd.DataFrame(data=None) for name in df_names
        }

data = Data(df_names=["df1", "better_named_df", "averages"])

# accessing underlying frames
data.dfs["df1"]
data.dfs["better_named_df"]

Another approach using a list-comprehension instead of a dictionary:

import pandas as pd

class Data:
    def __init__(self):
        self.dfs = [pd.DataFrame(data=None) for _ in range(3)]

data = Data()
data.dfs[0]
data.dfs[1]

Since you specified that you're just reading in these dataframes to run different queries against them, I wouldn't recommend a class at all. This is because there no common functionality that you're going to run against each dataframe, aside from reading them into memory. A function that returns a dictionary should suffice:

import pathlib
import pandas as pd

def read_data(base_dir, file_names):
    dataframes = {}
    
    base_dir = pathlib.Path(base_dir)
    for fname in file_names:
        fpath = base_dir / fname

        dataframes[fpath.stem] = pd.read_csv(fpath)
    return dataframes

# you can call this function like so:
dfs = read_data("./", ["file1.csv", "file2.csv", "file3.csv"])

# frames is a dictionary with this structure:
# {"file1": dataframe from file1.csv,
#  "file2": dataframe from file2.csv,
#  "file3": dataframe from file3.csv}

# access data like this
dfs["file1"]

If you are intent on having each DataFrame be an attribute you can take advantage of setattr .

class Data:
    def __init__(self, n):
        for num in range(1, n + 1):
            setattr(self, f"df{num}", pd.DataFrame())

Then whatever number you supply to the constructor, you would have that many DataFrame attributes on the object.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM