Removing duplicates in sql
In modern web development, it’s standard practice to make use of a database abstraction layer, typically an Object-Relational Mapper based on either the Active Record pattern or the Data Mapper pattern. There are several pros and cons to this which are fairly well established, so I’ll spare us all from enumerating them all right now.
One established pro worth mentioning is that these systems typically provide a high level interface for fetching data, often removing the need to ‘hand write’ SQL for anything but the most complex of query.
As good as this sounds it has a few drawbacks, but one that has come up several times recently and hence warrants a mention is that it’s becoming easier to overlook defining schemata that accurately model the data they are intended to store.
Specifically, the definition of correct indexes is something that can be easy to forget but have some pretty nasty side effects e.g. failing to use unique indexes in situations that require them, resulting in duplicate data in the database and a bad time for everyone.
For instance, say you have a user flagging system, a list of boolean values associated with a user, represented by the following table:
CREATE TABLE `userflag` ( `userflagid` mediumint(8) unsigned NOT NULL, `type` varchar(50) NOT NULL, `userid` int(11) NOT NULL, `timecreated` int(11) NOT NULL, PRIMARY KEY (`userflagid`), INDEX `flagidentity` (`userid`,`type`) );
Notice that the flagidentity index is not unique, meaning that the database will allow multiple rows for a user - flag tuple, which is incorrect and will be a problem.
Obviously, attempting to drop and recreate the index after non-unique values have been inserted will fail, meaning we need some way of detecting and removing the duplicates.
First step, detecting duplicates:
SELECT userid, type, count(*) AS total FROM userflag GROUP BY userid, type HAVING total > 1
This is a fairly trivial query, the idea being that if we group by the fields that should be unique, and make use of the HAVING clause, we can return a list of userid - type pairs that violate our desired uniqueness. Now to turn this into something that we can use to remove the duplicates:
DELETE f.* FROM userflag AS f JOIN ( SELECT userid, type, count(*) AS total, min(userflagid) AS userflagid FROM userflag GROUP BY userid, type HAVING total > 1 ) AS d ON f.userid = d.userid AND f.type = d.type AND f.userflagid > d.userflagid;
Here, the previous query has been used to build a delete query that has one interesting property - it will delete the duplicates but preserve the first inserted for each combination of userid & type.
After changing one line of code, everything doesn't feel good.