3 ways to remove duplicate data

Duplicate data can be generated in the database due to design error, software bugs or misuse by end users. Depending on the cause of the duplicate data, different commands should be used. In this post, we’ll show you three different ways to remove duplicate data.

Delete duplicates using Row_Number

Suppose you have a Clients table (Id, Name, Address and Zip the table’s columns) that doesn’t have a primary key. This is a design error you want to correct. But when you try to create the key, you identify that you have duplicate data: several groups of two or more rows where all the columns have the same value in distinct rows of each group. Then, you have to remove duplicate data in order to create the primary key.

To remove such duplicate data, you may use the Row_Number function to create a way to identify distinct rows that have the same data. Row_Number is an analytic function defined in standard SQL that is present in several current DBMS, including Oracle, MS SQL Server, MySQL, PostgreSQL and SQLite. But to delete duplicate data, it can be used togheter with other operators and functions that are exclusive from the DBMS you use.

For instance, in the following command, we use Row_Number togheter with Oracle’s Rowid pseudocolumn to remove duplicate data (see our post on Rowid for more information on such column) :

If you use MS SQL Server DBMS, then you may use Row_Number and the WITH clause, just like in the next command:

Using a table to store distinct data

Suppose you have the above described Clients table without primary key and duplicate data. Another way to solve this situation is to use a table to store distinct client data and then reload the data without duplicates into the original table. In the following, we create a temporary table to store distinct rows of Clients table (we use Oracle’s syntax to create the table, but similar behaviour can achieved in other DBMS with small changes).

After creating the Distinct_clients table, you can remove all the data in Clients table and, then, insert distinct rows.

Now, you can add the primary key to you Clients table and remove Distinct_clients from the database.

Using MIN/MAX functions

Now, suppose you have a primary key (a surrogate key) in the Clients table, but you found the user registered the same client several times in the database. That means there are several rows with the same data for all columns except for the primary key. To solve this situation, you may remove duplicate data using standard MIN/MAX functions.

In the following command, we use MIN. If you want to use MAX, you just have to change the last line in the subquery.

Choose the best command for your database and privileges!

Leave a Reply

Your email address will not be published. Required fields are marked *