简体   繁体   中英

How to model a fact table

I'm about to create a data warehouse with facts and dimensions in a star-schema.

The business questions I want to answer are typically these:

  • How much money did we sell for in Q1?
  • How much money did we sell for in Q1 to females?
  • How much money did we sell for in Q1 to females between age 30-35?
  • How much money did we sell for in Q1 to females between age 30-35 living in new york?
  • How much money did we sell for in Q1 to females between age 30-35 living in new york?

  • How much money did we sell for in category clothes last year?

  • How much money did we sell for of the product blue jeans last year?
  • How much money did we sell for of the product blue jeans to males between 40-42 living in Australia last year?

I am thinking of a date dimension with the granularity of an hour (specifying year, month, day, hour, quarter, name of day, name of month etc.) I am also thinking of a product dimension and a user dimension.

I wonder if these questions could be answered using a single fact table or if its proper to create multiple fact tables? I am thinking of a table such as:

FactSales

DimDate - fk to a table containting information about the date (such as quarter, day of week, year, month, day)

DimProduct - fk to a table containing information about the product such as (product name)

DimUser - fk to a table containing information about the user such as (age, gender)

TotalSales - a SUM of all sales for those particular date,product and user.

Also, if I would like to measure booth the total sales (money) and the total number of sales? Would it be proper to create a new fact table with the same dimensions but using TotalNumberOfSales as the fact instead?

Thankful for all input I can get about this.

I think you are on the right track. All questions above should be possible to answer using only one fact table covering up the sales.

I think one should start out unaggregated, and rather aggregate later if needed. Considering that one sale can contain multiple products and multiple items, I'd organize it as follows ... one fact row for each product in the sale (typically lines on the invoice, so I'd call it "order lines" or "sale lines"), and maybe three counter attributes:

  • NumItems - number of items, ie 3 if the customer bought three of the same product.
  • NumLines - number of "order lines" - should always be 1. May be useful when aggregating data later (big win to already have sum(NumLines) rather than count(*) in the SQL), or when adding correction items ( NumLines = -1 ).
  • NumSales - a fractional number so it can be summed up to yield the number of sales (ie 0.333 if the sale involves three different products and hence contains three order lines).

Now, one will get a problem to get the right count ie for "number of sales involving black clothes". We had this problem at my previous workplace - I'm sure there must exist some "best practice" for this, we ended up more or less by introducing a SaleID in the fact table (or TransactionID ) and do count(distinct SaleID) . That lacks elegance, but works.

In our setup we had several money attributes - most important, one for the revenue (what's left of the income after paying the direct costs attributed with the items sold) and one for the turnover (the price paid by the customer for the item). Sales tax or VAT may add more complications. One can make it with only one money attribute and then split the sales up into multiple lines in the fact table, but I think I would rather recommend multiple money columns in the sales line fact table. Everything in the fact table was counted in "base currency" (Euros, in our case), and then we had an exchange rate dimension to track the exact amounts.

I don't think it makes sense to have a date dimension containing the hour of the day. At my former work I kept my warehouse in postgres, and I actually managed quite well without a date dimension at all - although a date dimension is considered "best business practice" I found that performance-wise for all our purposes we got much better performance by using standard postgres date functions instead of dragging in a date dimension. I was playing quite a lot with it, and I think in the end I found the most optimal was to split up date and time into two different attributes. (Timezones and daylight saving gave me quite some extra headaches...)

I agree with tobixen - you're on the right track.

I would recommend that you read Ralph Kimball's book "The Data Warehouse Toolkit", particularly the chapter on retail sales - it goes in depth about a sales fact.

The Date Dimension is like having a Calendar table - you can split based on quarters, fiscal months, and other things that are business specific to dates. I typically keep both the date key as well as a timestamp datatype, so we can do things with the Fiscal Calendar. I would actually have a separate Time dimension if you need to have your grain of the table at that level, with buckets for hours of the day, or minutes, etc. I doubt you need hourly though.

Here's what I would do:

Declare the Grain of your fact table:

1 row per order line

Note how the grain doesn't contain anything that doesn't uniquely identify the row.

Dimensional Attributes of the order line:

Date
Time (if needed, and bucketed by hour/minute etc)
Product
Customer

Degenerate Dimensions of the order line (these are codes that are related to the transaction):

Order Number
Order Line Number

Some Sample Measures:

Item Price at time of Sale (optional, may be useful in some situations)
Discount Amount
Sale Dollars

This should answer all of those questions.

For the totals, a simple COUNT / SUM after filtering on the attributes of the dimensions should work fine.

You should consider that the customer dimension is one of the most difficult to model, Kimball devotes a whole chapter in his book to the customer dimension.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM