I am interested in dividing my data into thirds, but I only have a summary table of counts by a state. Specifically, I have estimated enrollment counts by state, and I would like to calculate what states comprise the top third of all enrollments. So, the top third should include at least a total cumulative percentage of .33333...
I have tried various means of specifying cumulative percentages between .33333 and .40000 but with no success in specifying the general case. PROC RANK
also can't be used because the data is organized as a frequency table...
I have included some dummy (but representative) data below.
data state_counts;
input state $20. enrollment;
cards;
CALIFORNIA 440233
TEXAS 318921
NEW YORK 224867
FLORIDA 181517
ILLINOIS 162664
PENNSYLVANIA 155958
OHIO 141083
MICHIGAN 124051
NEW JERSEY 117131
GEORGIA 104351
NORTH CAROLINA 102466
VIRGINIA 93154
MASSACHUSETTS 80688
INDIANA 75784
WASHINGTON 73764
MISSOURI 73083
MARYLAND 73029
WISCONSIN 72443
TENNESSEE 71702
ARIZONA 69662
MINNESOTA 66470
COLORADO 58274
ALABAMA 54453
LOUISIANA 50344
KENTUCKY 49595
CONNECTICUT 47113
SOUTH CAROLINA 46155
OKLAHOMA 43428
OREGON 42039
IOWA 38229
UTAH 36476
KANSAS 36469
MISSISSIPPI 33085
ARKANSAS 32533
NEVADA 27545
NEBRASKA 24571
NEW MEXICO 22485
WEST VIRGINIA 21149
IDAHO 20596
NEW HAMPSHIRE 19121
MAINE 18213
HAWAII 16304
RHODE ISLAND 13802
DELAWARE 12025
MONTANA 11661
SOUTH DAKOTA 11111
VERMONT 10082
ALASKA 9770
NORTH DAKOTA 9614
WYOMING 7457
DIST OF COLUMBIA 6487
;
run;
***** calculating the cumulative frequencies by hand ;
proc sql;
create table dummy_3 as
select
state,
enrollment,
sum(enrollment) as total_enroll,
enrollment / calculated total_enroll as percent_total
from state_counts
order by percent_total desc ;
quit;
data dummy_4; set dummy_3;
if first.percent_total then cum_percent = 0;
cum_percent + percent_total;
run;
Based on the value for cum_percent
, the states that make up the top third of enrollment are: California, Texas, New York, Florida, and Illinois.
Is there any way to do this programatically? I'd eventually like to specify a flag variable for selecting states.
Thanks...
It seems like you're 90% of the way there. If you just need a way to put cum_percent into flagged buckets, setting up a format is pretty straightforward.
proc format;
value pctile
low-0.33333 = 'top third'
0.33333<-.4 = 'next bit'
0.4<-high = 'the rest'
;
run;
options fmtsearch=(work);
And add a statement at the end of your datastep:
pctile_flag = put(cum_percent,pctile.);
Rewrite your last data step like this:
data dummy_4(drop=found);
set dummy_3;
retain cum_percent 0 found 0;
cum_percent + percent_total;
if cum_percent < (1/3) then do;
top_third = 1;
end;
else if ^found then do;
top_third = 1;
found =1;
end;
else
top_third = 0;
run;
note: your first.
syntax is incorrect. first.
and last
. only work on BY groups. You get the right values in CUM_PERCENT by way of the cum_percent + percent_total;
statement.
I am not aware of a PROC that will do this for you.
You can easily count percentages using PROC FREQ with WEIGHT statement and then select those in the first third using LAG function:
proc freq data=state_counts noprint order=data;
tables state / out=state_counts2;
weight enrollment;
run;
data top3rd;
set state_counts2;
cum_percent+percent;
if lag(cum_percent)<100/3 then top_third=1;
run;
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.