简体   繁体   English

dict理解嵌套列表以过滤多个变量的值

[英]dict comprehension for nested lists to filter values of multiple variables

I have a working example of dict comprehension on a list I iterate over: This generates various indicators (selections), separating the rows of my data into cases (which are not exclusive, by the way). 我在遍历的列表上有一个dict理解的工作示例:这会生成各种指示符(选择),将我的数据行分成案例(顺便说一句,这不是排他的)。

For context: This is done to count cases for specific rows (criterion defined by a column) when I aggregate the table to some groups. 对于上下文:当我将表聚合到某些组时,这样做是为了统计特定行(由列定义的标准)的情况。 The indicators are collected in separate dataframes now to export separately, though I am also happy to keep all in one dataframe for a single aggregation, concatenation and export, if possible. 这些指标现在被收集在单独的数据框中,以便分别导出,尽管我也很高兴将所有数据保存在一个数据框中,以便在可能的情况下进行一次汇总,连接和导出。

Now I want to nest this into another loop. 现在,我想将其嵌套到另一个循环中。 This loop would define which other variable I select/filter for the values. 该循环将定义我为值选择/过滤的其他变量。 So item 0 would still be the condition itself (sum of the indicator being the count of the cases), but item 1 the selected cases of TKOST (to see a selective sum for separate criteria later), item 2 for another variable I'd now read in. 因此,项目0仍将是条件本身(指标的总和为案例数),但项目1为TKOST的选定案例( TKOST将看到用于单独标准的选择性总和),项目2为另一个变量现在读。

But it would make sense for this loop to effect the variable names too, eg to have a blank neuro variable for the count (or neuro_count ), a neuro_cost for the sum of TKOST for the neuro cases etc. How is this possible? 但是对于该循环来说,也要影响变量名是有意义的,例如,对于计数(或neuro_count )具有空白的neuro变量,对于神经病例而言,对于neuro_cost的总和具有TKOST等。这怎么可能?

The sample code basically comes from Alexander's answer on another question. 示例代码基本上来自Alexander在另一个问题上的答案 The file I/O and pandas parts are provided for context. 提供文件I / O和pandas部分用于上下文。

import pandas as pd

items = {'neuro': 'N', 
         'cardio': 'C', 
         'cancer': 'L', 
         'anesthetics': 'N01', 
         'analgesics': 'N02', 
         'antiepileptics': 'N03', 
         'anti-parkinson drugs': 'N04', 
         'psycholeptics': 'N05', 
         'psychoanaleptics': 'N06', 
         'addiction_and_other_neuro': 'N07', 
         'Adrugs': 'A', 
         'Mdrugs': 'M', 
         'Vdrugs': 'V', 
         'all_drugs': ''}

# Create data containers using dictionary comprehension.
dfs = {item: pd.DataFrame() for item in items.keys()}
monthly_summaries = {item: list() for item in items.keys()}

# Perform monthly groupby operations.
for year in xrange(2005, 2013):
    for month in xrange(1, 13):
        if year == 2005 and month < 7:
            continue
        filename = 'PATH/STUB_' + str(year) + '_mon'+ str(month) +'.txt'
        monthly = pd.read_table(filename,usecols=[0,3,32])
        monthly['year'] = year
        monthly['month'] = month
        dfs = {name: monthly[(monthly.ATC.str.startswith('{0}'.format(code))) 
                             & (~(monthly.TKOST.isnull()))]
                     for name, code in items.iteritems()}
        [monthly_summaries[name].append(dfs[name].groupby(['LopNr','year','month']).sum()
                                        .astype(int, copy=False)) 
         for name in items.keys()]

# Now concatenate all of the monthly summaries into separate DataFrames.
dfs = {name: pd.concat([monthly_summaries[name]], ignore_axis=True) 
       for name in items.keys()}

# Now regroup the aggregate monthly summaries.
monthly_summaries = {name: dfs[name].reset_index().groupby(['LopNr','year','month']).sum()
                    for name in items.keys()}

# Finally, save the aggregated results to files.
[monthly_summaries[name].to_csv('PATH/monthly_{0}_costs.csv'.format(name))
 for name in items()]

You should prefer an explicit for loop: 您应该首选显式的for循环:

for name in items.keys():
    monthly_summaries[name].append(dfs[name].groupby(['LopNr','year','month']).sum()
                                            .astype(int, copy=False)

# rather than
[monthly_summaries[name].append(dfs[name].groupby(['LopNr','year','month']).sum()
                                         .astype(int, copy=False)) 
    for name in items.keys()]

The latter creates a dummy list of None s (as well as being less readable) so is less efficient. 后者创建了一个None的伪列表(并且可读性较差),因此效率较低。

The former allows you to nest easily... 前者可以让您轻松筑巢...


But it would make sense for this loop to effect the variable names too, eg to have a blank neuro variable for the count (or neuro_count), a neuro_cost for the sum of TKOST for the neuro cases etc. How is this possible? 但是对于这个循环来说,也影响变量名是有意义的,例如,对于计数(或Neuro_count)具有空白的神经变量,对于神经病例,对于TKOST的总和具有Neuro_cost等。这怎么可能?

I usually add columns to do these counts, that way it can be vectorized/split/other. 我通常会添加列来进行这些计数,这样就可以将其矢量化/拆分/其他。
(Then don't write these columns out to csv.) (然后不要将这些列写到csv中。)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM