简体   繁体   中英

How to subset data using local macros

I have some code that uses local macros in Stata to subset an aggregate data set into quantiles by asset size. That code looks like this:

local quantile 0 25 50 75 99                
  foreach quantile in `quantile' {
      preserve                

  //Template for the top quantile in set
      if `quantile' == 99 {
          egen bottomcutoff = pctile(assets), p(`quantile')
          keep if assets > bottomcutoff
      }
      
  //Template for the bottom quantile in set
      else if `quantile' == 0 {
          local quantile10=`quantile'+10
          egen topcutoff = pctile(assets), p(`quantile10')
          keep if assets <= topcutoff
      }
      
  //Template for middle quantiles of distance 24
      else if `quantile' == 75 {
          egen bottomcutoff = pctile(assets), p(`quantile')
          local quantile10=`quantile'+24
          egen topcutoff = pctile(assets), p(`quantile10')
          keep if assets > bottomcutoff & assets <= topcutoff
      }
      
  //Template for middle quantiles of distance 25
      else {
          egen bottomcutoff = pctile(assets), p(`quantile')
          local quantile10=`quantile'+25
          egen topcutoff = pctile(assets), p(`quantile10')
          keep if assets > bottomcutoff & assets <= topcutoff
      }
          }

I'm trying to retrofit this code to subset by asset size thresholds instead of by percentile, but I'm having trouble figuring out how to get it to work properly with this local method. The thresholds I need are less than 10, between 10-100 (i just averaged to 55 because I don't know a better way to call it), and greater than 100. Here's what I've tried so far:

local subset 10 55 100
        foreach subset in `subsets' {
        preserve                

    //Template for the  subset greater than 100000000
        if `subset' == 100 {
            gen subset_obs = (assets >= 100000000)
            bysort company_id : egen subset_id = max(subset_obs)
            keep if subset_id == 1
        }
        
    //Template for less than or equal to 10000000
        else if `subset' == 10 {
            gen subset_obs = (assets <= 10000000)
            bysort company_id : egen subset_id = max(subset_obs)
            keep if subset_id == 1
        }
        
    //Template for between 10000000 and 100000000
        else if `subset' == 55 {
            gen subset_obs = (assets > 10000000 & assets < 100000000)
            bysort company_id : egen subset_id = max(subset_obs)
            keep if subset_id == 1
        }
        
        } 

This isn't a complete answer but it would be much harder to read as a comment.

Consider the results of

gen which = cond(assets <= 1e7, 1, cond(assets <= 1e8, 2, 3)) 

bysort company_id (which) : replace which = which[_N] 

That's a simpler way of selecting whether something was ever true than using egen, max() . Note that using 1e7 and 1e8 releases the reader from counting zeros.

This code classifies companies by the highest assets band they ever reached: not higher than 10 million (unstated currency units), not higher than 100 million, or higher still.

A consequence of your code is that the classification is not guaranteed disjoint. In principle a company that straddles two or three bands would be selected for two or three subsets. Perhaps that is what you want.

This comment is optimistic and assumes that assets is never missing. If it is, then the code needs to be more careful, as missing counts as arbitrarily large and so larger than 1e8.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM