简体   繁体   English

在SAS中汇总实际年份

[英]Aggregating Over Actual Year in SAS

Lets suppose we have the following table ("Purchases"): 假设我们有下表(“购买”):

Date                 Units_Sold             Brand       Year
18/03/2010                5                   A         2010
12/04/2010                2                   A         2010
22/05/2010                1                   A         2010
25/05/2010                7                   A         2010
11/08/2011                5                   A         2011
12/07/2010                2                   B         2010
22/10/2010                1                   B         2010
05/05/2011                7                   B         2011

And the same logic continues until the end of 2014, for different brands. 对于不同品牌,同样的逻辑一直持续到2014年底。

What I want to do is calculate the number of Units_Sold for every Brand, in each year. 我想做的是计算每年每个品牌的Units_Sold数量。 However, I don't want to do it for the calendar year, but for the actual year. 但是,我不想在日历年中执行此操作,而是在实际年份中执行此操作。

So an example of what I don't want: 所以我想要的一个例子:

proc sql;
create table Dont_Want as
select Year, Brand, sum(Units_Sold) as Unit_per_Year
from Purchases
group by Year, Brand;
quit;

The above logic is ok if we know that eg Brand "A" exists throughout the whole 2010. But if Brand "A" appeared on 18/03/2010 for the first time, and exists until now, then a comparison of Years 2010 and 2011 would not be good enough as for 2010 we are "lacking" 3 months. 如果我们知道例如品牌“ A”在整个2010年都存在,则上述逻辑是可以的。但是,如果品牌“ A”第一次出现在18/03/2010并一直存在到现在,则可以比较2010年和2011年将不够好,因为2010年我们将“短缺” 3个月。

So what I want to do is calculate: 所以我想做的是计算:

for A: the sum from 18/03/2010 until 17/03/2011, then from 18/03/2011 until 17/03/2012, etc. 对于A:从18/03/2010到17/03/2011,然后从18/03/2011到17/03/2012的总和,依此类推。

for B: the sum from 12/07/2010 until 11/07/2011, etc. B:从2010年12月7日至2011年7月11日的总和,依此类推。

and so on for all Brands. 以此类推。

Is there a smart way of doing this? 有这样做的聪明方法吗?

Step 1: Make sure your dataset is sorted or indexed by Brand and Date 第1步:确保按品牌和日期对数据集进行排序或编制索引

proc sort data=want;
     by brand date;
run;

Step 2: Calculate the start/end dates for each product 步骤2:计算每种产品的开始/结束日期

The idea behind the below code: 下面的代码背后的想法:

  1. We know that the first occurrence of the brand in the sorted dataset is the day in which the brand was introduced. 我们知道品牌在分类数据集中的首次出现是品牌被引入的日期。 We'll call this Product_Year_Start . 我们将其称为Product_Year_Start

  2. The intnx function can be used to increment that date by 365 days, then subtract 1 from it. intnx函数可用于将该日期增加365天,然后从中减去1。 Let's call this date Product_Year_End . 我们将此日期称为Product_Year_End

  3. Since we now know the product's year end date, we know that if the date on any given row exceeds the product's year end date, we have started the next product year. 由于我们现在知道产品的年末日期,因此我们知道,如果任何给定行上的日期都超过产品的年末日期,那么我们就开始下一个产品年。 We'll just take the calculated Product_Year_End and Product_Year_Start for that brand and bump them up by one year. 我们只计算该品牌的Product_Year_EndProduct_Year_Start ,并将它们提高一年。

This is all achieved using by-group processing and the retain statement. 所有这些都是通过按组处理和retain语句来实现的。

data Comparison_Dates;
    set have;
    by brand date;

    retain Product_Year_Start Product_Year_End;

    if(first.brand) then do;
        Product_Year_Start = date;
        Product_Year_End = intnx('year', date, 1, 'S') - 1;
    end;

    if(Date > Product_Year_End) then do;
        Product_Year_Start = intnx('year', Product_Year_Start, 1, 'S');
        Product_Year_End = intnx('year', Product_Year_End, 1, 'S');
    end;

    format Product_Year_Start Product_Year_End date9.;
run;

Step 3: Using the original SQL code, group instead by the new product start/end dates 步骤3:使用原始的SQL代码,按新产品的开始/结束日期分组

proc sql;
    create table want as
    select catt(year(Product_Year_Start), '-', year(Product_Year_End) ) as Product_Year
         , Brand
         , sum(Units_Sold) as Unit_per_Year
    from Comparison_Dates
    group by Brand, calculated Product_Year
    order by Brand, calculated Product_Year;
quit;

The following code is doing what you ask in a literal sense, for the earliest 'date' of each 'brand', it start aggregating 'unitssold', when hits 365 days mark, it resets count, and starts another cycle. 下面的代码按照字面意义进行操作,对于每个“品牌”的最早“日期”,它开始聚合“单位销售”,当达到365天标记时,它将重置计数,并开始另一个周期。

data have;
    informat date ddmmyy10.;
    input date units_sold brand $ year;
    format date date9.;
    cards;
18/03/2010                5                   A         2010
12/04/2010                2                   A         2010
22/05/2010                1                   A         2010
25/05/2010                7                   A         2010
11/08/2011                5                   A         2011
12/07/2010                2                   B         2010
22/10/2010                1                   B         2010
05/05/2011                7                   B         2011
;

proc sort data=have;
    by brand date;
run;

data want;
    do until (last.brand);
        set have;
        by brand date;

        if first.brand then
            do;
                Sales_Over_365=0;
                _end=intnx('day',date,365);
            end;

        if date <= _end then
            Sales_Over_365+units_sold;
        else
            do;
                output;
                Sales_Over_365=units_sold;
                _end=intnx('day',date,365);
            end;
    end;

    output;
    drop _end;
run;

You need to have a start date for each brand. 您需要每个品牌的开始日期。 For now we can use the first sale date, but that might not be what you want. 目前,我们可以使用第一个销售日期,但这可能不是您想要的。 Then you can classify each sales date into which year it is for that brand. 然后,您可以将每个销售日期分类为该品牌的年份。

Let's start by creating a dataset from your sample data. 让我们从样本数据创建数据集开始。 The YEAR variable is not needed. 不需要YEAR变量。

data have ;
  input Date Units_Sold Brand $ Year ;
  informat date ddmmyy10.;
  format date yymmdd10.;
cards;
18/03/2010 5 A 2010
12/04/2010 2 A 2010
22/05/2010 1 A 2010
25/05/2010 7 A 2010
11/08/2011 5 A 2011
12/07/2010 2 B 2010
22/10/2010 1 B 2010
05/05/2011 7 B 2011
;;;;

Now we can get the answer you want with an SQL query. 现在,我们可以通过SQL查询获得所需的答案。

proc sql ;
  create table want as
   select brand
        , start_date
        , 1+floor((date - start_date)/365) as sales_year
        , intnx('year',start_date,calculated sales_year -1,'same')
            as start_sales_year format=yymmdd10.
        , sum(units_sold) as total_units_sold
  from
  ( select brand
        , min(date) as start_date format=yymmdd10.
        , date
        , units_sold
    from have
    group by 1
   )
  group by 1,2,3,4
  ;
quit;

This will produce this result: 这将产生以下结果:

                                               total_
                       sales_      start_      units_
Brand    start_date     year     sales_year     sold
  A      2010-03-18       1      2010-03-18      15
  A      2010-03-18       2      2011-03-18       5
  B      2010-07-12       1      2010-07-12      10

There is no straight forward way of doing it. 没有直接的方法可以做到这一点。 You can do something like this. 你可以做这样的事情。

To test the code, I saved your table in to a text file. 为了测试代码,我将您的表保存到一个文本文件中。

Then I created a class called Sale. 然后,我创建了一个名为Sale的类。

public class Sale
{
    public DateTime Date { get; set; }
    public int UnitsSold { get; set; }
    public string Brand { get; set; }
    public int Year { get; set; }
}

Then I populated a List<Sale> using the saved text file. 然后,使用保存的文本文件填充List<Sale>

var lines = File.ReadAllLines(@"C:\Users\kosala\Documents\data.text");
var validLines = lines.Where(l => !l.Contains("Date")).ToList();//remove the first line.

List<Sale> sales = validLines.Select(l => new Sale()
        {
            Date = DateTime.Parse(l.Substring(0,10)),
            UnitsSold = int.Parse(l.Substring(26,5)),
            Brand = l.Substring(46,1),
            Year = int.Parse(l.Substring(56,4)),
        }).ToList();

//All the above code is for testing purposes. The actual code starts from here.
var totalUnitsSold = sales.OrderBy(s => s.Date).GroupBy(s => s.Brand);

        foreach (var soldUnit in totalUnitsSold)
        {
            DateTime? minDate = null;
            DateTime? maxDate = null;
            int total = 0;
            string brand = "";

            foreach (var sale in soldUnit)
            {
                brand = sale.Brand;
                if (minDate == null)
                {
                    minDate = sale.Date;
                }
                if ((sale.Date - minDate).Value.Days <= 365)
                {
                    maxDate = sale.Date;
                    total += sale.UnitsSold;
                }
                else
                {
                    break;
                }
            }
            Console.WriteLine("Brand : {0} UnitsSold Between {1} - {2} is {3}",brand, minDate.Value, maxDate.Value, total);
   }

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM