简体   繁体   中英

How should you separate dimension tables from fact tables if you are not building a data warehouse?

I realize that referring to these as dimension and fact tables is not exactly appropriate. I am at a lost for better terminology, so please excuse this categorization that I use in the post.

I am building an application for employee record keeping.

The database will contain organizational information. The information is mostly defined in three tables: Locations, Divisions, and Departments. However, there are others with similar problems. First, I need to store the available values for these tables. This will allow for available values in the application when managing an employee and for management of these values when adding/deleting departments and such. For instance, the Locations table may look like,

LocationId | LocationName | LocationStatus
1 | New York | Active
2 | Denver | Inactive
3 | New Orleans | Active

I then need to store these values for each employee and keep their history. My first thought was to create LocationHistory, DivisionHistory, and DepartmentHistory tables. I cannot pinpoint why, but this struck me as poor design. My next inclination was to create a DimLocation/FactLocation, DimDivision/FactDivision, DimDepartment/FactDepartment set of tables. I do not believe this makes sense either. I have also considered naming them as a combination of Employee, ie EmployeeLocations, EmployeeDivisions, etc. Regardless of the naming convention for these tables, I imagine that data would look similar to a simplified version I have below:

EmployeeId | LocationId | EffectiveDate | EndDate
1 | 3 | 2008-07-01 | NULL
1 | 2 | 2007-04-01 | 2008-06-30

I realize any of the imagined solutions I described above could work, but I am really looking to create a design that will be easy for others to maintain with an intuitive, familiar structure. I would like to receive this community's help, opinions, and experience with this matter. I am open to and would welcome any suggestion to consider. For instance, should I even store the available values for these three tables in the database? Should they be maintained in the application code/business logic layer? Do I just need to get over seeing the word History repeating three times?

Thanks!

Firstly, I see no issue in describing these as Dimension and Fact tables outside of a warehouse:)

In terms of conceptualising and understanding the relationships, I personally see the use of start/end dates perfectly easy for people to understand. Allowing Agent and Location fact tables, and then time dependant mapping tables such as Agent_At_Location, etc. They do, however, have issues worthy of taking note.

  1. If EndDate is 2008-08-30 , was the employee in that location UP TO 30th August, or UP TO and INCLUDING 30th August.

  2. Dealing with overlapping date periods in queries can give messy queries, but more importantly, slow queries.


The first one seems simply a matter of convention, but it can have certain implications when dealign with other data. For example, consider that an EndDate of 2008-08-30 means that they ARE at that location UP TO and INCLUDING 30th August. Then you join on to their Daily Agent Data for that day (Such as when they Actually arrived at work, left for breaks, etc). You need to join ON AgentDailyData.EventTimeStamp < '2008-08-30' + 1 in order to include all the events that happened during that day.

This is because the data's EventTimeStamp isn't measured in days, but probably minutes or seconds.

If you consider that the EndDate of '2008-08-30' means that the Agent was at that Location UP TO but NOT INCLDUING 30th August, the join does not need the + 1 . In fact you don't need to know if the date is DAY bound, or can include a time component or not. You just need TimeStamp < EndDate .

By using EXCLUSIVE End markers, all of your queries simplify and never need + 1 day , or + 1 hour to deal with edge conditions.


The second one is much harder to resolve. The simplest way of resolving an overlapping period is as follows:

SELECT
  CASE WHEN TableA.InclusiveFrom > TableB.InclusiveFrom THEN TableA.InclusiveFrom ELSE TableB.InclusiveFrom END AS [NetInclusiveFrom],
  CASE WHEN TableA.ExclusiveFrom < TableB.ExclusiveFrom THEN TableA.ExclusiveFrom ELSE TableB.ExclusiveFrom END AS [NetExclusiveFrom],
FROM
  TableA
INNER JOIN
  TableB
    ON  TableA.InclusiveFrom < TableB.ExclusiveFrom
    AND TableA.ExclusiveFrom > TableB.InclusiveFrom

-- Where InclusiveFrom is the StartDate
-- And   ExclusiveFrom is the EndDate, up to but NOT including that date

The problem with that query is one of indexing. The first condition TableA.InclusiveFrom < TableB.ExclusiveFrom could be be resolved using an index. But it could give a Massive range of dates. And then, for each of those records, the ExclusiveDate s could all be just about anything, and certainly not in an order that could help quickly resolve TableA.ExclusiveFrom > TableB.InclusiveFrom

The solution I have previously used for that is to have a maximum allowed gap between InclusiveFrom and ExclusiveFrom . This allows something like...

    ON  TableA.InclusiveFrom <  TableB.ExclusiveFrom
    AND TableA.InclusiveFrom >= TableB.InclusiveFrom - 30
    AND TableA.ExclusiveFrom >  TableB.InclusiveFrom

The condition TableA.ExclusiveFrom > TableB.InclusiveFrom STILL can't benefit from indexes. But instead we've limitted the number of rows that can be returned by searching TableA.InclusiveFrom . It's at most only ever 30 days of data, because we know that we restricted the duration to a maximum of 30 days.

An example of this is to break up the associations by calendar month (max duration of 31 days).

EmployeeId | LocationId | EffectiveDate | EndDate
    1      |     2      |  2007-04-01   | 2008-05-01
    1      |     2      |  2007-05-01   | 2008-06-01
    1      |     2      |  2007-06-01   | 2008-06-25

(Representing Employee 1 being in Location 2 from 1st April to (but not including) 25th June.)

It's effectively a trade off; using Disk Space to gain performance.

I've even seen this pushed to the extreme of not actually storing date Ranges, but storing the actual mapping for each and every day. Essentially, it's like restricting the maximum duration to 1 day...

EmployeeId | LocationId | EffectiveDate
    1      |     2      |  2007-06-23  
    1      |     2      |  2007-06-24  
    1      |     3      |  2007-06-25  
    1      |     3      |  2007-06-26  

Instinctively I initially rebelled against this. But in subsequent ETL, Warehousing, Reporting, etc, I actually found it Very powerful, adaptable, and maintainable. I actually saw people making fewer coding mistakes, writing code in less time, the code ending up running faster, and being much more able to adapt to clients' changing needs.

The only two down sides were:
1. More disk space taken (But trival compared to the size of fact tables)
2. Inserts and Updates to this mapping was slower

The actual slow down for Inserts and Updates only actually matter Once, where this model was being used to represent a constantly changing process net; where the app wanted to change the mapping about 30 times a second. Even then it worked, it just chomped up more CPU time than was ideal.

If you want to be efficient and keep a history, do these things. There are multiple solutions to this problem, but this is the one that I keep going back to:

  1. Remember that each row represents a single entity, if you make corrections that entity, that's fine, but don't re-use and ID for a new Location. Set it up so that instead of deleting a Location, you mark it as deleted with a bit and hide it from the interface, that way when it's referenced historically, it's still there.

  2. Create a history table that includes the current value, or no records if a value isn't currently set. Have the foreign key tie back to the employee and tie to the location.

  3. Create a column in the employee table that points to the current active location in the history. When you need to get the employees location, you join in the history table based on this ID. When you need to get all of the history for an employee you join from the history table.

  4. This structure keeps it all normalized, and gives you an easy way to find the current value without having to do any date comparisons.

  5. As far as using the word history, think of it in different terms: since it contains the current item as well as historical items, it's really just a junction table that keeps around the old item. As such you can name it something like EmployeeLocations.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM