You have a query, maybe one you didn’t write. How do you test and validate that the joins are correct? Let’s walk through a simple example. I’m running MySQL with the employees sample database installed (https://dev.mysql.com/doc/employee/en/). Below is a query to find the current department for each employee. How do we know if it is correct?
select
e.emp_no,
e.first_name,
e.last_name,
de.dept_no
from
employees e
join dept_emp de on e.emp_no = de.emp_no
where
curdate() between de.from_date and de.to_date;
For this article we are not going to use a data profiling tool. Many of these steps can be replaced by a competent data profiling tool (ydata profiling is a good choice).
Step One
First we modify the query to return a record count.
select
count(1)
from
employees e
join dept_emp de on e.emp_no = de.emp_no
where
curdate() between de.from_date and de.to_date;
This query returns 240,124 records. Is this what we expect?
Step Two
Next we looking at the record counts of each table.
select
count(1)
from
employees;
select
count(1)
from
dept_emp;
The employees table has 300,024 records, dept_emp has 331,603 records. There are 300k employee records, we now know that it did not find a department record for each employee. Another thing to look for at this stage is an inflated record count that indicates a possible many to many join.
Step Three
What next? Let’s test the columns that make up the join expression. Starting with employees, we check for nulls and whether emp_no is unique for each record.
select
count(1)
from
employees
where
emp_no is null;
select
emp_no,
count(1)
from
employees
group by
emp_no
having
count(1) > 1;
Both queries return zero records – therefore emp_no is a primary/unique key. Alternately, we could also explore the system catalog / DDL for the table to see that there is a primary index on the column. Insights can usually be gleaned from relational database system catalogs, use them if they are available. If your keys are singular columns then your data profiling tool should answer these questions. Compound primary keys usually require manual tests.
Testing dept_emp is a bit more complicated as it’s a child table, but the approach is similar. We expect one department for each employee on a given date.
select
count(1)
from
dept_emp
where
emp_no is null;
select
dept_no,
emp_no,
count(1)
from
dept_emp
where
curdate() between from_date and to_date
group by
dept_no,
emp_no
having
count(1) > 1;
Both queries return zero records – this tells us the emp_no + dept_no is unique for a given date. So far so good, it is a likely a good key candidate but we still don’t know why we are not finding a department for all employees.
What about from_date and to_date?
select
count(1)
from
dept_emp
where
from_date is null;
select
count(1)
from
dept_emp
where
to_date is null;
Both queries return 0 records. If either contained nulls then we would have to consider the implications of null comparisons in our where clause. At this point I would move on to the checks below while keeping in mind that we haven’t looked at the distribution of dates in these fields.
Step Four
At this point we are getting into context specific testing. Steps 1-3 can be executed in almost all cases.
So we can check if all employees have a department at some point in time. There are potentially many dept_emp records for a given emp_no, so we rewrite the query as a (not) in test.
select
count(1)
from
employees
where
emp_no not in (select distinct emp_no from dept_emp);
This too returns 0 records. So every employee has a department assignment at some point in time, but not at the current date. Checking the employees table structure, we see that we have a hire date but no termination date. A potential explanation is that we have employees that are no longer at the company. We can test this somewhat, but this is ultimately a business logic question.
select
e.emp_no,
max(de.to_date) as max_to_date
from
employees e
join dept_emp de on e.emp_no = de.emp_no
group by e.emp_no;
This returns 300,024 records as expected. What stands out is that most, but not all, employees have a maximum to_date of ‘9999-01-01’. Let’s dig into that a bit.
select * from
(
select
e.emp_no,
max(de.to_date) as max_to_date
from
employees e
join dept_emp de on e.emp_no = de.emp_no
group by
e.emp_no
) foo
where
max_to_date != '9999-01-01';
Almost 60k records which is what we’d expect (300k employee records, 240k with a current department). Let’s confirm that the max_to_date is in the past which would be evidence of the terminated employee theory.
select max(max_to_date) from
(
select
e.emp_no,
max(de.to_date) as max_to_date
from
employees e
join dept_emp de on e.emp_no = de.emp_no
group by
e.emp_no
) foo
where
max_to_date != '9999-01-01';
This returns 2002-08-01. A date well in the past (old test dataset!) this gives us enough information to return to the subject matter experts to question how terminated employees are represented in the data.
Conclusion
Is the query correct? Yes, it answers the question of the current department of each employee. Confirming the question of how terminated employees are represented in the data may lead to uncovering a data quality issue.