It is really common to have columns that stores sequences of numbers in database systems (specially in relational databases). Users and analysts expect these fields to store the values sequentially without gaps. That is, there would have no value missing between the initial one and the maximum value stored at the column. However, various situations may lead to gaps and missing values in such fields, including runtimes and application design problems. Gaps can exist even when using database Sequences (see our post on Sequences).
In this post, we present three ways to identify missing values and gaps in numeric database fields.
I – Using an auxiliary table that have no gaps – Or the alternative you would probably not use at first!
Let´s first suppose you have a table that numeric sequences with no gaps or missing values. This table would help you to find gaps in other ones. That would be excellent! You just have to use an Outer Join or a Not Exists query (see this post for how to use Outer Join instead of Not Exists or see this post on how to write Not Exists queries from Not In commands).
Suppose AllValuesColumn is the name of tha collumn that has no gaps and AllValuesTable is the table of such column. If you want to find the gaps in the Id column of the Products table, you would just have to execute one of the following commands:
What is the problem with such strategy? You probably DO NOT have such table with no gaps a priori. It is exactly what you are looking for: to find gaps in existing tables.
One may think: it is just to create a table and load the rows with all the required values in it. It may seem a straighfoward alternative, but it would take some work to create and load the table. So, what if you can build such table on the fly? Let´s see how to do that in the next alternative.
II – Using a recursive query – may have low performance and DBMS specific syntax
If the problem is you don’t have a list of all the possible values for a column (e.g. like the above AllValuesColumn), then you can build such list on the fly using recursive queries.
Many DBMS can execute recursive SQL commands, but the syntax os such type of commands usually differ betwenn of management systems.
In the following command, we use an Oracle’s syntax for recursive query. In such command, the recursive query creates a virtual list of all the numbers we need to look for in the Customer_ID column of the Curstomers table. Such virtual list (i.e. the List subquery) contains all the values between 1 and the highest Customer_ID. Then, an Not Exists clause is used to look for the values that are missing in the Customers table. In such example, values 1, 3, 8 and 11 are missing in the Customer_ID column.
The use of the recursive query can be somewhat confusing. You can find a detailed description on such example in our post on Looking for missing values using recursive queries. Another example of recursive query usage is available in this post, which uses recursive a querie to generate a list of dates.
Some of the problems of such strategy are that recursive queries may have performance problems, are somewhat more complex than regular ones and have DBMS specific syntax. So, using a recursive query is the best option? Probably, not.
Lett’s take a look at another alternative.
Using Lead or Lag functions: the alternative that runs over several DBMS
LEAD and LAG are window functions that enable the simultaneous access to a table’s row and to other row related (that follows or precedes) to it without the use of a self join (i.e. join the table to itself). LAG lets you access the value from in a row that lags (precedes) the current row while provides access to a row that follows the current row. Several DBMS provide support to such functions, including Oracle, Microsoft SQL Server, PostgreSQL and MySQL. So this alternative should be fine for you, if you are looking for to build portable SQL commands.
In the following, we present an example on the use of LAG to access two rows of the product table that precede the current one. You can find a more detailed description of such function in this post.
Both LEAD and LAG can be used to look for missing values. You just have to think this way: if a column stores sequential numbers, and we access rows’ values in a ordered way, when the difference between the values of sequential rows is greater than 1, a value is missing. This is just what we look for in the next command: id – prev_id > 1.
You can also use such functions to build formated results with intervals of missing values, instead of listing all values, just like in the following query.
A detailed explaination on the use of LEAD and LAG to find gaps and missing values, together with another examples and uses, can be found here. Besides being more portable, LEAD and LAG would have better performance than recursive queries in many situations.
Now, that you have seen the pros and cons of each alternative, you may choose the one that is best for you!