简体   繁体   English

使用 for 循环生成 pandas dataframe

[英]Generate a pandas dataframe with for-loop

I have generated a dataframe (called 'sectors') that stores information from my brokerage account (sector/industry, sub sector, company name, current value, cost basis, etc).我生成了一个 dataframe(称为“部门”),用于存储来自我的经纪账户的信息(部门/行业、子部门、公司名称、当前价值、成本基础等)。

I want to avoid hard coding a filter for each sector or sub sector to find specific data.我想避免为每个扇区或子扇区硬编码过滤器来查找特定数据。 I have achieved this with the following code (I know, not very pythonic, but I am new to coding):我已经使用以下代码实现了这一点(我知道,不是很pythonic,但我是编码新手):

for x in set(sectors_df['Sector']):
    x_filt = sectors_df['Sector'] == x
    #value in sect takes the sum of all current values in a given sector
    value_in_sect = round(sectors_df.loc[x_filt]['Current Value'].sum(), 2)
    #pct in sect is the % of the sector in the over all portfolio (total equals the total value of all sectors) 
    pct_in_sect = round((value_in_sect/total)*100 , 2)
    print(x, value_in_sect, pct_in_sect)

for sub in set(sectors_df['Sub Sector']):
    sub_filt = sectors_df['Sub Sector'] == sub
    value_of_subs = round(sectors_df.loc[sub_filt]['Current Value'].sum(), 2)
    pct_of_subs = round((value_of_subs/total)*100, 2)
    print(sub, value_of_subs, pct_of_subs)

My print statements produce the majority of the information I want, although I am still working through how to program for the % of a sub sector within its own sector.我的打印报表产生了我想要的大部分信息,尽管我仍在研究如何为自己部门内的子部门的百分比编程。 Anyways, I would now like to put this information (value_in_sect, pct_in_sect, etc) into dataframes of their own.无论如何,我现在想将这些信息(value_in_sect、pct_in_sect 等)放入他们自己的数据帧中。 What would be the best way or the smartest way or the most pythonic way to go about this? go 的最佳方式或最聪明的方式或最pythonic的方式是什么? I am thinking a dictionary, and then creating a dataframe from the dictionary, but not sure.我在想一本字典,然后从字典中创建一个 dataframe,但不确定。

The split-apply-combine process in pandas , specifically aggregation , is the best way to go about this. pandas中的split-apply-combine 过程,特别是聚合,是 go 关于此的最佳方式。 First I'll explain how this process would work manually, and then I'll show how pandas can do it in one line.首先我将解释这个过程如何手动工作,然后我将展示 pandas 如何在一行中完成。

Manual split-apply-combine手动拆分应用组合

Split分裂

First, divide the DataFrame into groups of the same Sector.首先,将 DataFrame 分成同一 Sector 的组。 This involves getting a list of Sectors and figuring out which rows belong to it (just like the first two lines of your code).这涉及获取扇区列表并确定哪些行属于它(就像代码的前两行一样)。 This code runs through the DataFrame and builds a dictionary with keys as Sectors and a list of indices of rows from sectors_df that correspond to it.此代码通过 DataFrame 运行,并构建一个字典,其中键为 Sectors 和对应于它的部门sectors_df的行索引列表。

sectors_index = {}
for ix, row in sectors_df.iterrows():
    if row['Sector'] not in sectors_index:
        sectors_index[row['Sector']] = [ix]
    else:
        sectors_index[row['Sector']].append(ix)

Apply申请

Run the same function, in this case summing of Current Value and calculating its percentage share, on each group.在每个组上运行相同的 function,在这种情况下对Current Value求和并计算其百分比份额。 That is, for each sector, grab the corresponding rows from the DataFrame and run the calculations in the next lines of your code.也就是说,对于每个扇区,从 DataFrame 中获取相应的行并在代码的下一行中运行计算。 I'll store the results as a dictionary of dictionaries: {'Sector1': {'value_in_sect': 1234.56, 'pct_in_sect': 11.11}, 'Sector2': ... } for reasons that will become obvious later:我将结果存储为字典字典: {'Sector1': {'value_in_sect': 1234.56, 'pct_in_sect': 11.11}, 'Sector2': ... }原因稍后将变得显而易见:

sector_total_value = {}
total_value = sectors_df['Current Value'].sum()
for sector, row_indices in sectors_index.items():
    sector_df = sectors_df.loc[row_indices]
    current_value = sector_df['Current Value'].sum()
    sector_total_value[sector] = {'value_in_sect': round(current_value, 2),
                                  'pct_in_sect': round(current_value/total_value * 100, 2)
                                 }

(see footnote 1 for a note on rounding) (有关四舍五入的说明,请参见脚注 1)

Combine结合

Finally, collect the function results into a new DataFrame, where the index is the Sector.最后,将 function 结果收集到一个新的 DataFrame 中,其中索引为 Sector。 pandas can easily convert this nested dictionary structure into a DataFrame: pandas可以轻松地将这个嵌套字典结构转换为 DataFrame:

sector_total_value_df = pd.DataFrame.from_dict(sector_total_value, orient='index')

split-apply-combine using groupby使用groupby拆分应用组合

pandas makes this process very simple using the groupby method. pandas使用groupby方法使这个过程变得非常简单。

Split分裂

The groupby method splits a DataFrame into groups by a column or multiple columns (or even another Series): groupby方法将 DataFrame 按一列或多列(甚至另一个系列)分成组:

grouped_by_sector = sectors_df.groupby('Sector')

grouped_by_sector is similar to the index we built earlier, but the groups can be manipulated much more easily, as we can see in the following steps. grouped_by_sector类似于我们之前构建的索引,但可以更轻松地操作组,正如我们在以下步骤中看到的那样。

Apply申请

To calculate the total value in each group, select the column or columns to sum up, use the agg or aggregate method with the function you want to apply:要计算每个组中的总值,select 要对列或列求和,请对要应用的 function 使用aggaggregate方法:

sector_total_value = grouped_by_sector['Current Value'].agg(value_in_sect=sum)

Combine结合

It's already done!已经完成了! The apply step already creates a DataFrame where the index is the Sector (the groupby column) and the value in the value_in_sect column is the result of the sum operation.应用步骤已经创建了一个 DataFrame,其中索引是扇区(groupby 列), value_in_sect列中的值是sum运算的结果。

I've left out the pct_in_sect part because a) it can be more easily done after the fact:我省略了pct_in_sect部分,因为 a) 事后可以更容易地完成:

sector_total_value_df['pct_in_sect'] = round(sector_total_value_df['value_in_sect'] / total_value * 100, 2)
sector_total_value_df['value_in_sect'] = round(sector_total_value_df['value_in_sect'], 2)

and b) it's outside the scope of this answer. b)它在这个答案的 scope 之外。

Most of this can be done easily in one line (see footnote 2 for including the percentage, and rounding):大部分都可以在一行中轻松完成(参见脚注 2 包括百分比和四舍五入):

sector_total_value_df = sectors_df.groupby('Sector')['Current Value'].agg(value_in_sect=sum)

For subsectors, there's one additional consideration, which is that grouping should be done by Sector and Subsector rather than just Subsector, so that, for example rows from Utilities/Gas and Energy/Gas aren't combined.对于子行业,还有一个额外的考虑因素,那就是分组应该按行业子行业进行,而不仅仅是子行业,这样,例如来自公用事业/天然气和能源/天然气的行就不会合并。

subsector_total_value_df = sectors_df.groupby(['Sector', 'Sub Sector'])['Current Value'].agg(value_in_sect=sum)

This produces a DataFrame with a MultiIndex with levels 'Sector' and 'Sub Sector', and a column 'value_in_sect'.这将生成一个 DataFrame,其 MultiIndex 具有“Sector”和“Sub Sector”级别,以及“value_in_sect”列。 For a final piece of magic, the percentage in Sector can be calculated quite easily:对于最后一个魔法,Sector 中的百分比可以很容易地计算出来:

subsector_total_value_df['pct_within_sect'] = round(subsector_total_value_df['value_in_sect'] / sector_total_value_df['value_in_sect'] * 100, 2)

which works because the 'Sector' index level is matched during division.之所以有效,是因为在除法期间匹配了“扇区”索引级别。

Footnote 1. This deviates from your code slightly, because I've chosen to calculate the percentage using the unrounded total value, to minimize the error in the percentage.脚注 1. 这与您的代码略有不同,因为我选择使用未四舍五入的总值来计算百分比,以最大限度地减少百分比中的错误。 Ideally though, rounding is only done at display time.但理想情况下,舍入仅在显示时进行。

Footnote 2. This one-liner generates the desired result, including percentage and rounding:脚注 2. 这个单行生成所需的结果,包括百分比和四舍五入:

sector_total_value_df = sectors_df.groupby('Sector')['Current Value'].agg(
    value_in_sect = lambda c: round(sum(c), 2),
    pct_in_sect = lambda c: round(sum(c)/sectors_df['Current Value'].sum() * 100, 2),
)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM