Slowly changing dimension tables can be challenging to validate as the effective dates span records. The expectation is that for a given entity we have one or more records with a range of effective dates without overlaps or gaps in coverage. So lets write some test cases!
We will start with a simple employee dimension table that tracks employee name and salary changes over time. We’ll introduce some errors and will use the employee name to identify the test case. No employees were harmed in this testing.
There are variations on how to model a type two slowly changing dimension. Our rules are we have a clearly defined beginning of time (1900-01-01) and end of time (9999-12-31). Our effective dates do not overlap, that is the end_date of one record is succeeded by a start_date that is one day later. Some people model with overlaps – the technique here works in both scenarios with a minor code edit. Some models don’t include a beginning of time record. That’s another topic but we will include it here.
All code is mySQL compliant, you may need slight adjustments depending upon your platform.
Create Test Data
drop table if exists employee_dim;
create table employee_dim
(
emp_id integer not null,
emp_name varchar(100) not null,
salary decimal(10,2) not null,
start_date date not null,
end_date date not null
);
-- Insert some test data, use emp_name to indicate the scenario being tested
insert into employee_dim values (1, "well formed", 1000, "1900-01-01", "2020-01-01");
insert into employee_dim values (1, "well formed", 1100, "2020-01-02", "2021-01-01");
insert into employee_dim values (1, "well formed", 1200, "2021-01-02", "9999-12-31");
insert into employee_dim values (2, "overlapping dates", 1000, "1900-01-01", "2020-01-01");
insert into employee_dim values (2, "overlapping dates", 1100, "2020-01-01", "2021-01-01");
insert into employee_dim values (2, "overlapping dates", 1200, "2020-12-31", "9999-12-31");
insert into employee_dim values (3, "date gaps", 1000, "1900-01-01", "2020-01-01");
insert into employee_dim values (3, "date gaps", 1100, "2020-01-03", "2021-01-01");
insert into employee_dim values (3, "date gaps", 1200, "2021-01-02", "9999-12-31");
insert into employee_dim values (4, "encapsulated dates", 1000, "1900-01-01", "2022-01-01");
insert into employee_dim values (4, "encapsulated dates", 1100, "2020-01-02", "2021-01-01");
insert into employee_dim values (4, "encapsulated dates", 1200, "2022-01-02", "9999-12-31");
insert into employee_dim values (5, "duplicate dates", 1000, "1900-01-01", "9999-12-31");
insert into employee_dim values (5, "duplicate dates", 1100, "1900-01-01", "9999-12-31");
insert into employee_dim values (6, "well formed with matching dates", 1000, "1900-01-01", "2022-01-01");
insert into employee_dim values (6, "well formed with matching dates", 1100, "2022-01-02", "2022-01-02");
insert into employee_dim values (6, "well formed with matching dates", 1200, "2022-01-03", "9999-12-31");
Let’s examine well formed data for one employee. The date ranges do not overlap and have no gaps. The dates ranges begin and end with our defined constants.
emp_id | emp_name | salary | start_date | end_date |
1 | well formed | 1000.00 | 1900-01-01 | 2020-01-01 |
1 | well formed | 1100.00 | 2020-01-02 | 2021-01-01 |
1 | well formed | 1200.00 | 2021-01-02 | 9999-12-31 |
Date Range Testing
We are going to compute two columns with windowing functions. The first is the row number of the window. This is important as there’s no preceding dates for the first record – this is an edge case to consider. We partition by the employee id and order by both dates. Ordering by both dates is important as we may have records where the start_date and end_date are the same. One record like this is acceptable, more than one is an error case (present in our test data).
(row_number() over(partition by emp_id order by start_date, end_date)) as window_row_number
The second column is the delta between the current record’s start_date and the preceding record’s end_date. This clause is likely where code adjustments may be needed depending upon your SQL implementation.
datediff(start_date, lag(end_date) over(partition by emp_id order by start_date, end_date)) AS date_delta
Before building the test cases let’s look at what we get for the well formed employee. We have a delta of 1 for all records except for the first.
emp_id | emp_name | start_date | end_date | window_row_number | date_delta |
1 | well formed | 1900-01-01 | 2020-01-01 | 1 | NULL |
1 | well formed | 2020-01-02 | 2021-01-01 | 2 | 1 |
1 | well formed | 2021-01-02 | 9999-12-31 | 3 | 1 |
Next we wrap these into a sub-query and test the results. The sub-query is required as you can’t use the windowing functions in a where / having clause. Our first test is that the number of days between records is always one.
Test One – Number of days between records
select
*
from
(
select
emp_id,
emp_name,
start_date,
end_date,
(row_number() over(partition by emp_id order by start_date, end_date)) as window_row_number,
datediff(start_date, lag(end_date) over(partition by emp_id order by start_date, end_date)) AS date_delta
from
employee_dim
) foo
where
window_row_number > 1 and
date_delta != 1;
And our results:
emp_id | emp_name | start_date | end_date | window_row_number | date_delta |
2 | overlapping dates | 2020-01-01 | 2021-01-01 | 2 | 0 |
2 | overlapping dates | 2020-12-31 | 9999-12-31 | 3 | -1 |
3 | date gaps | 2020-01-03 | 2021-01-01 | 2 | 2 |
4 | encapsulated dates | 2020-01-02 | 2021-01-01 | 2 | -730 |
4 | encapsulated dates | 2022-01-02 | 9999-12-31 | 3 | 366 |
5 | duplicate dates | 1900-01-01 | 9999-12-31 | 2 | -2958463 |
This query returns each of our test error cases. From here you can take the emp_id values and apply fixes. Note that the fix may be adjusting the preceding records which are not present in the test case.
NOTE: If your implementation of effective dates has end_date = start_date on the subsequent record then the date_delta test should be ‘!=0’ instead of ‘!=1’.
Test Two – Beginning of Time
Assuming we have a defined beginning of time, we can test for it.
select
*
from
(
select
emp_id,
emp_name,
start_date,
end_date,
(row_number() over(partition by emp_id order by start_date, end_date)) as window_row_number
from
employee_dim
) foo
where
window_row_number = 1 and
start_date != "1900-01-01"
You can also do this as a simple select min(start_date) grouped by emp_id test.
Test Three – End of Time
Here we’ll flpp the ordering to descending to test the end of time date.
select
*
from
(
select
emp_id,
emp_name,
start_date,
end_date,
(row_number() over(partition by emp_id order by start_date desc, end_date desc)) as window_row_number
from
employee_dim
) foo
where
window_row_number = 1 and
start_date != "999-12-31"
Conclusion
This isn’t the complete list of tests, we would also want to test for nulls in key fields if our underlying storage layer doesn’t enforce constraints. You may want to test that a change was actually captured. You may want to set a threshold for the number of rows per employee. A high record count can be indicative of an attribute that is behaving more like a fact. Finally you may want to alter the output / wrap these tests in your pipeline with appropriate alerting.