简体   繁体   中英

Aggregating Over Actual Year in SAS

Lets suppose we have the following table ("Purchases"):

Date                 Units_Sold             Brand       Year
18/03/2010                5                   A         2010
12/04/2010                2                   A         2010
22/05/2010                1                   A         2010
25/05/2010                7                   A         2010
11/08/2011                5                   A         2011
12/07/2010                2                   B         2010
22/10/2010                1                   B         2010
05/05/2011                7                   B         2011

And the same logic continues until the end of 2014, for different brands.

What I want to do is calculate the number of Units_Sold for every Brand, in each year. However, I don't want to do it for the calendar year, but for the actual year.

So an example of what I don't want:

proc sql;
create table Dont_Want as
select Year, Brand, sum(Units_Sold) as Unit_per_Year
from Purchases
group by Year, Brand;
quit;

The above logic is ok if we know that eg Brand "A" exists throughout the whole 2010. But if Brand "A" appeared on 18/03/2010 for the first time, and exists until now, then a comparison of Years 2010 and 2011 would not be good enough as for 2010 we are "lacking" 3 months.

So what I want to do is calculate:

for A: the sum from 18/03/2010 until 17/03/2011, then from 18/03/2011 until 17/03/2012, etc.

for B: the sum from 12/07/2010 until 11/07/2011, etc.

and so on for all Brands.

Is there a smart way of doing this?

Step 1: Make sure your dataset is sorted or indexed by Brand and Date

proc sort data=want;
     by brand date;
run;

Step 2: Calculate the start/end dates for each product

The idea behind the below code:

  1. We know that the first occurrence of the brand in the sorted dataset is the day in which the brand was introduced. We'll call this Product_Year_Start .

  2. The intnx function can be used to increment that date by 365 days, then subtract 1 from it. Let's call this date Product_Year_End .

  3. Since we now know the product's year end date, we know that if the date on any given row exceeds the product's year end date, we have started the next product year. We'll just take the calculated Product_Year_End and Product_Year_Start for that brand and bump them up by one year.

This is all achieved using by-group processing and the retain statement.

data Comparison_Dates;
    set have;
    by brand date;

    retain Product_Year_Start Product_Year_End;

    if(first.brand) then do;
        Product_Year_Start = date;
        Product_Year_End = intnx('year', date, 1, 'S') - 1;
    end;

    if(Date > Product_Year_End) then do;
        Product_Year_Start = intnx('year', Product_Year_Start, 1, 'S');
        Product_Year_End = intnx('year', Product_Year_End, 1, 'S');
    end;

    format Product_Year_Start Product_Year_End date9.;
run;

Step 3: Using the original SQL code, group instead by the new product start/end dates

proc sql;
    create table want as
    select catt(year(Product_Year_Start), '-', year(Product_Year_End) ) as Product_Year
         , Brand
         , sum(Units_Sold) as Unit_per_Year
    from Comparison_Dates
    group by Brand, calculated Product_Year
    order by Brand, calculated Product_Year;
quit;

The following code is doing what you ask in a literal sense, for the earliest 'date' of each 'brand', it start aggregating 'unitssold', when hits 365 days mark, it resets count, and starts another cycle.

data have;
    informat date ddmmyy10.;
    input date units_sold brand $ year;
    format date date9.;
    cards;
18/03/2010                5                   A         2010
12/04/2010                2                   A         2010
22/05/2010                1                   A         2010
25/05/2010                7                   A         2010
11/08/2011                5                   A         2011
12/07/2010                2                   B         2010
22/10/2010                1                   B         2010
05/05/2011                7                   B         2011
;

proc sort data=have;
    by brand date;
run;

data want;
    do until (last.brand);
        set have;
        by brand date;

        if first.brand then
            do;
                Sales_Over_365=0;
                _end=intnx('day',date,365);
            end;

        if date <= _end then
            Sales_Over_365+units_sold;
        else
            do;
                output;
                Sales_Over_365=units_sold;
                _end=intnx('day',date,365);
            end;
    end;

    output;
    drop _end;
run;

You need to have a start date for each brand. For now we can use the first sale date, but that might not be what you want. Then you can classify each sales date into which year it is for that brand.

Let's start by creating a dataset from your sample data. The YEAR variable is not needed.

data have ;
  input Date Units_Sold Brand $ Year ;
  informat date ddmmyy10.;
  format date yymmdd10.;
cards;
18/03/2010 5 A 2010
12/04/2010 2 A 2010
22/05/2010 1 A 2010
25/05/2010 7 A 2010
11/08/2011 5 A 2011
12/07/2010 2 B 2010
22/10/2010 1 B 2010
05/05/2011 7 B 2011
;;;;

Now we can get the answer you want with an SQL query.

proc sql ;
  create table want as
   select brand
        , start_date
        , 1+floor((date - start_date)/365) as sales_year
        , intnx('year',start_date,calculated sales_year -1,'same')
            as start_sales_year format=yymmdd10.
        , sum(units_sold) as total_units_sold
  from
  ( select brand
        , min(date) as start_date format=yymmdd10.
        , date
        , units_sold
    from have
    group by 1
   )
  group by 1,2,3,4
  ;
quit;

This will produce this result:

                                               total_
                       sales_      start_      units_
Brand    start_date     year     sales_year     sold
  A      2010-03-18       1      2010-03-18      15
  A      2010-03-18       2      2011-03-18       5
  B      2010-07-12       1      2010-07-12      10

There is no straight forward way of doing it. You can do something like this.

To test the code, I saved your table in to a text file.

Then I created a class called Sale.

public class Sale
{
    public DateTime Date { get; set; }
    public int UnitsSold { get; set; }
    public string Brand { get; set; }
    public int Year { get; set; }
}

Then I populated a List<Sale> using the saved text file.

var lines = File.ReadAllLines(@"C:\Users\kosala\Documents\data.text");
var validLines = lines.Where(l => !l.Contains("Date")).ToList();//remove the first line.

List<Sale> sales = validLines.Select(l => new Sale()
        {
            Date = DateTime.Parse(l.Substring(0,10)),
            UnitsSold = int.Parse(l.Substring(26,5)),
            Brand = l.Substring(46,1),
            Year = int.Parse(l.Substring(56,4)),
        }).ToList();

//All the above code is for testing purposes. The actual code starts from here.
var totalUnitsSold = sales.OrderBy(s => s.Date).GroupBy(s => s.Brand);

        foreach (var soldUnit in totalUnitsSold)
        {
            DateTime? minDate = null;
            DateTime? maxDate = null;
            int total = 0;
            string brand = "";

            foreach (var sale in soldUnit)
            {
                brand = sale.Brand;
                if (minDate == null)
                {
                    minDate = sale.Date;
                }
                if ((sale.Date - minDate).Value.Days <= 365)
                {
                    maxDate = sale.Date;
                    total += sale.UnitsSold;
                }
                else
                {
                    break;
                }
            }
            Console.WriteLine("Brand : {0} UnitsSold Between {1} - {2} is {3}",brand, minDate.Value, maxDate.Value, total);
   }

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM