简体   繁体   English

Python中的DataFrame的DataFrame(Pandas)

[英]DataFrame of DataFrames in Python (Pandas)

The idea here is that for every year, I am able to create three dataframes(df1, df2, df3), each containing different firms and stock prices('firm' and 'price' are the two columns in df1~df3). 这里的想法是,每年,我能够创建三个数据框(df1,df2,df3),每个数据框包含不同的公司和股票价格('公司'和'价格'是df1~df3中的两列)。 I would like to use another dataframe (named 'store' below) to store the three dataframes every year. 我想使用另一个数据帧(下面命名为“store”)来存储每年的三个数据帧。

Here is what I code: 这是我的代码:

store = pd.DataFrame(list(range(1967,2014)), columns=['year'])
for year in range(1967,2014):
    ....some codes that allow me to generate df1, df2 and df3 correctly...
    store.loc[store['year']==year, 'df1']=df1
    store.loc[store['year']==year, 'df2']=df2
    store.loc[store['year']==year, 'df3']=df3

I am not getting error warning or anything after this code. 我没有收到错误警告或此代码后的任何内容。 But in the "store" dataframe, columns 'df1', 'df2' and 'df3' are all 'NAN' values. 但在“商店”数据框中,列'df1','df2'和'df3'都是'NAN'值。

I think that pandas offers better alternatives to what you're suggesting (rationale below). 我认为大熊猫提供了更好的替代方案(你的建议如下)。

For one, there's the pandas.Panel data structure, which was meant for things like you're doing here. 首先,有pandas.Panel数据结构,它适用于你在这里做的事情。

However, as Wes McKinney (the Pandas author) noted in his book Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython , multi-dimensional indices, to a large extent, offer a better alternative. 然而,正如Wes McKinney(熊猫作者)在他的“ 数据分析Python:与熊猫,NumPy和IPython的数据争夺”一书中指出的那样,多维指数在很大程度上提供了更好的选择。

Consider the following alternative to your code: 考虑以下代码替代方法:

dfs = []
for year in range(1967,2014):
    ....some codes that allow me to generate df1, df2 and df3 
    df1['year'] = year
    df1['origin'] = 'df1'
    df2['year'] = year
    df2['origin'] = 'df2'
    df3['year'] = year
    df3['origin'] = 'df3'
    dfs.extend([df1, df2, df3])
df = pd.concat(dfs)

This gives you a DataFrame with 4 columns: 'firm' , 'price' , 'year' , and 'origin' . 这为您提供了一个包含4列的DataFrame: 'firm''price''year''origin'

This gives you the flexibility to: 这使您可以灵活地:

  • Organize hierarchically by, say, 'year' and 'origin' : df.set_index(['year', 'origin']) , by, say, 'origin' and 'price' : df.set_index(['origin', 'price']) 按照'year''origin'等级组织: df.set_index(['year', 'origin']) ,比如'origin''price'df.set_index(['origin', 'price'])

  • Do groupby s according to different levels 根据不同的级别进行groupby

  • In general, slice and dice the data along many different ways. 通常,以许多不同的方式对数据进行切片和切块。

What you're suggesting in the question makes one dimension (origin) arbitrarily different, and it's hard to think of an advantage to this. 你在这个问题中提出的建议使得一个维度(起源)任意地不同,并且很难想到它的优势。 If a split along some dimension is necessary due, to, eg, performance, you can combine DataFrames better with standard Python data structures: 如果由于某些维度的分割是必要的,例如性能,您可以将DataFrames与标准Python数据结构更好地结合起来:

  • A dictionary mapping each year to a Dataframe with the other three dimensions. 每年将字典映射到具有其他三个维度的Dataframe。

  • Three DataFrames, one for each origin, each having three dimensions. 三个DataFrame,每个原点一个,每个都有三个维度。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM