Introduction
You know your way around the select statement, you can join tables in various ways, use aggregate functions, can alter existing tables and create new objects, etc. But sometimes your SQL just doesn’t work, or it somewhat works giving odd results. So what’s the next step to take in improving your SQL skills?
I’m decent with SQL, good enough at least that I’m often called on to debug other people’s code. There’s one problem area that comes up more than all others, one that manifests itself in many different ways. But at it’s core the problem is the same. What is it? Simply, the coder does not understand nulls and how they impact SQL processing.
So what is null and how should we think about it? Here are some definitions, pick one or more that resonate:
- null is the representation of “missing information and inapplicable information” – E.F. Codd
- null means a value is not known or doesn’t exist
- null is not a value, null is the absence of value
- null is a state, not a value
Zero is a value, an empty string ” is a value. If we don’t know the answer then we use null to represent this. How many angels can dance on the head of a pin? null angels
You probably know this already, but this has many implicatations and some of them are subtle. For example, assume you have (a == b) in a case statement or a where clause. You’ve worked with several programming languages, you see what appears to be a boolean expression and can intuit what happens when the expression is either true or false. However, this isn’t a boolean expression, it’s a ternary expression. If you think it’s a boolean expression then you are missing 1/3 of the possibilities as the possible outcomes are true, false, and null.
Truth Tables
Assume we have two variables, a and b. Each variable may either be True, False, or null. What do we get when we combine them in different ways?
a | b | a OR b | a AND b | a = b |
---|---|---|---|---|
True | True | True | True | True |
True | False | True | False | False |
True | null | True | null | null |
False | True | True | False | False |
False | False | False | False | True |
False | null | null | False | null |
null | True | True | null | null |
null | False | null | False | null |
null | null | null | null | null |
Notice that the presence of a null in either column can cause the result to be null. For equality tests, the presence of a null in either column results in a null. For OR you can get True results if one or both columns are True. But a False / null combination results in null, you can’t definitively say that the expression is either True or False. Likewise with AND a True/null results in null. The last line is especially interesting, a null/null equality check does not return True.
It’s important to understand the truth tables for AND, OR, and EQUALITY.
Examples
(2 = null) -> result is null.
A known value does not equal null.
(null and true) -> result is null.
An unknown value isn’t definitely true.
(null or true) -> result is true.
The null is meaningless in this expression.
(null = null) -> result is null.
The unknown doesn’t necessarily equal the unknown.
(null != null) -> result is null
The unknown doesn’t necessarily not equal the unknown.
Finally, 99.99% of the time you are making a mistake if you type these
some_val = null
some_val != null
OK I get it, but do you?
select
case i
when null then 'null'
when 0 then 'zero'
else 'non-zero'
end as foo
from bar;
This will never return ‘null’, because i = null is always false. There are many instances where an equality test is being performed without an equality sign.
-- Correct syntax
select
case i
when i is null then 'null'
when 0 then 'zero'
else 'non-zero'
end as foo
from bar;
How about joins?
select
*
from
foo
where
bar = null;
The result set is zero rows, even where the bar column contains nulls.
select
*
from
foo
where
(bar = 1) or
(bar != 1);
versus
select
*
from
foo;
These are not the same, the first query will not return records where bar is null.
select
*
from
foo
left join bar on foo.key = bar.key
where
bar.another_var = 'value';
This is an inner join! If bar doesn’t have a matching record then all columns in bar are null. Therefore, they fail the equality test. This is a very common mistake!
select
*
from
foo
where
some_val in (select some_val from bar);
Notes: The rows where some_val is null in either table will never be matched even if both tables have rows where some_val is null.
null in (null)
Notes: this is an unknown
What about math?
null * 2 = null
null / 0 = ??
NOTES: could be null, could give division by zero error (platform dependent. That’s right, you can’t count on null behaviour to be consistent across database platforms. Yay!)
What about counting?
These return the same count…
count()
count(1)
count('foo')
…but this is not when some_col contains null values.
count(some_col)
How about sums and averages?
sum(i) / count(*)
…is not the same as…
avg(i) when i contains nulls
Notes: avg(), sum() excludes nulls when calculating. This may not be what you want! Also, you can’t trust that all databases handle this case the same.
Grouping and sorting
(null = null) -> null
But for purposes of grouping and sorting, all nulls are equal!
For sorting, nulls will typically come first. Some SQL interpreters allow you to specify where nulls will appear.
order by some_val nulls last;
Dealing with nulls
Often the best way to handle nulls is to replace them with a known value or filter them out. The null functions that are available to you will vary by database platform. Most have some variation on the following:
The correct way to test for the presence of nulls, use in where clauses and case statements.
case
some_col is null
some_col is not null
Useful functions
Null functions vary between database implementations, but you can almost always find some mix of the following.
coalesce(col1, col2, col3)
This returns the first variable encountered that contains a value
nvl(expression, value)
This returns value if expression is null. Very handy!
concetenate(col1, col2, col3)
concat(col1, col2, col3)
These return a concatenated string that ignores null values instead of returning null for the entire expression.
‘string’ || null will result in a null on many, but not all database platforms (sideeye Oracle).
It could be worse!
- E.F. Codd suggested two values for null, “missing but applicable” and “missing but inapplicable”
- “high null” and “low null” have also been suggested
- Either would result in 4 value logic when most coders think it’s 2 value logic
- Nulls are confusing in SQL because they break fundamental assumptions built into relational theory (closed universe)
Summary
The horse is dead and has been flogged appropriately. My recommendation is to know beforehand which columns have null and to plan accordingly.