Beyond Basic SQL – Time of Event Date Validation

B

Slowly changing dimension tables can be challenging to validate as the effective dates span records. The expectation is that for a given entity we have one or more records with a range of effective dates without overlaps or gaps in coverage. So lets write some test cases!

We will start with a simple employee dimension table that tracks employee name and salary changes over time. We’ll introduce some errors and will use the employee name to identify the test case. No employees were harmed in this testing.

There are variations on how to model a type two slowly changing dimension. Our rules are we have a clearly defined beginning of time (1900-01-01) and end of time (9999-12-31). Our effective dates do not overlap, that is the end_date of one record is succeeded by a start_date that is one day later. Some people model with overlaps – the technique here works in both scenarios with a minor code edit. Some models don’t include a beginning of time record. That’s another topic but we will include it here.

All code is mySQL compliant, you may need slight adjustments depending upon your platform.

Create Test Data

drop table if exists employee_dim;

create table employee_dim
(
  emp_id integer not null,
  emp_name varchar(100) not null,
  salary decimal(10,2) not null,
  start_date date not null,
  end_date date not null
);

-- Insert some test data, use emp_name to indicate the scenario being tested
insert into employee_dim values (1, "well formed", 1000, "1900-01-01", "2020-01-01");
insert into employee_dim values (1, "well formed", 1100, "2020-01-02", "2021-01-01");
insert into employee_dim values (1, "well formed", 1200, "2021-01-02", "9999-12-31");

insert into employee_dim values (2, "overlapping dates", 1000, "1900-01-01", "2020-01-01");
insert into employee_dim values (2, "overlapping dates", 1100, "2020-01-01", "2021-01-01");
insert into employee_dim values (2, "overlapping dates", 1200, "2020-12-31", "9999-12-31");

insert into employee_dim values (3, "date gaps", 1000, "1900-01-01", "2020-01-01");
insert into employee_dim values (3, "date gaps", 1100, "2020-01-03", "2021-01-01");
insert into employee_dim values (3, "date gaps", 1200, "2021-01-02", "9999-12-31");

insert into employee_dim values (4, "encapsulated dates", 1000, "1900-01-01", "2022-01-01");
insert into employee_dim values (4, "encapsulated dates", 1100, "2020-01-02", "2021-01-01");
insert into employee_dim values (4, "encapsulated dates", 1200, "2022-01-02", "9999-12-31");

insert into employee_dim values (5, "duplicate dates", 1000, "1900-01-01", "9999-12-31");
insert into employee_dim values (5, "duplicate dates", 1100, "1900-01-01", "9999-12-31");

insert into employee_dim values (6, "well formed with matching dates", 1000, "1900-01-01", "2022-01-01");
insert into employee_dim values (6, "well formed with matching dates", 1100, "2022-01-02", "2022-01-02");
insert into employee_dim values (6, "well formed with matching dates", 1200, "2022-01-03", "9999-12-31");

Let’s examine well formed data for one employee. The date ranges do not overlap and have no gaps. The dates ranges begin and end with our defined constants.

emp_idemp_namesalarystart_dateend_date
1well formed1000.001900-01-012020-01-01
1well formed1100.002020-01-022021-01-01
1well formed1200.002021-01-029999-12-31

Date Range Testing

We are going to compute two columns with windowing functions. The first is the row number of the window. This is important as there’s no preceding dates for the first record – this is an edge case to consider. We partition by the employee id and order by both dates. Ordering by both dates is important as we may have records where the start_date and end_date are the same. One record like this is acceptable, more than one is an error case (present in our test data).

(row_number() over(partition by emp_id order by start_date, end_date)) as window_row_number

The second column is the delta between the current record’s start_date and the preceding record’s end_date. This clause is likely where code adjustments may be needed depending upon your SQL implementation.

datediff(start_date, lag(end_date) over(partition by emp_id order by start_date, end_date)) AS date_delta

Before building the test cases let’s look at what we get for the well formed employee. We have a delta of 1 for all records except for the first.

emp_idemp_namestart_dateend_datewindow_row_numberdate_delta
1well formed1900-01-012020-01-011NULL
1well formed2020-01-022021-01-0121
1well formed2021-01-029999-12-3131

Next we wrap these into a sub-query and test the results. The sub-query is required as you can’t use the windowing functions in a where / having clause. Our first test is that the number of days between records is always one.

Test One – Number of days between records

select
  *
from
(
  select
    emp_id,
    emp_name,
    start_date,
    end_date,
    (row_number() over(partition by emp_id order by start_date, end_date)) as window_row_number,
    datediff(start_date, lag(end_date) over(partition by emp_id order by start_date, end_date)) AS date_delta
  from
     employee_dim
) foo
where
  window_row_number > 1 and
  date_delta != 1;

And our results:

emp_idemp_namestart_dateend_datewindow_row_numberdate_delta
2overlapping dates2020-01-012021-01-0120
2overlapping dates2020-12-319999-12-313-1
3date gaps2020-01-032021-01-0122
4encapsulated dates2020-01-022021-01-012-730
4encapsulated dates2022-01-029999-12-313366
5duplicate dates1900-01-019999-12-312-2958463

This query returns each of our test error cases. From here you can take the emp_id values and apply fixes. Note that the fix may be adjusting the preceding records which are not present in the test case.

NOTE: If your implementation of effective dates has end_date = start_date on the subsequent record then the date_delta test should be ‘!=0’ instead of ‘!=1’.

Test Two – Beginning of Time

Assuming we have a defined beginning of time, we can test for it.

select
  *
from
(
  select
    emp_id,
    emp_name,
    start_date,
    end_date,
    (row_number() over(partition by emp_id order by start_date, end_date)) as window_row_number
  from
    employee_dim
) foo
where
  window_row_number = 1 and
  start_date != "1900-01-01"

You can also do this as a simple select min(start_date) grouped by emp_id test.

Test Three – End of Time

Here we’ll flpp the ordering to descending to test the end of time date.

select
  *
from
(
  select
    emp_id,
    emp_name,
    start_date,
    end_date,
    (row_number() over(partition by emp_id order by start_date desc, end_date desc)) as window_row_number
  from
    employee_dim
) foo
where
  window_row_number = 1 and
  start_date != "999-12-31"

Conclusion

This isn’t the complete list of tests, we would also want to test for nulls in key fields if our underlying storage layer doesn’t enforce constraints. You may want to test that a change was actually captured. You may want to set a threshold for the number of rows per employee. A high record count can be indicative of an attribute that is behaving more like a fact. Finally you may want to alter the output / wrap these tests in your pipeline with appropriate alerting.

By Glen Pennington

Glen Pennington

I have over 30 years of IT experience in the application of computer technology to many fields of business. I have deep experience in developing ETL/ELT pipelines to populate data warehouses, data lakes, and data products built on traditional relational database and cloud based platforms such as Snowflake and the Spark ecosystem.

As a data architect I can design and implement data products with consistent naming standards and rigorous data quality standards. These solutions are built on insights gained through data profiling and enforced through embedded data quality checks.

I have extensive experience using many languages and platforms and I have experience in several software development methodologies.

Where to find me on the socials