Thursday, February 26, 2009

MySQL Delete Duplicate Records with Alter Table

So your application has been humming along live for a couple of months and suddenly you find that one of your database tables contains duplicate records. You review the table structure and find that it does not have a unique constraint defined on a field (or group of fields) that should be unique. So how can these duplicates be eliminated?

I've encountered many solutions for joining a table to itself to identify and delete groups of duplicates, but on large (1GB+) tables this will cause a full table scan (unless you run it in batches) and could be very difficult on a production server. Another option is process the deletions through a temporary table and then rename the tables. So after you've eliminated the duplicates you still need to add the unique constraint to the table; wouldn't it be better if you could handle everything in one shot?

MySQL includes an option for that in the ALTER TABLE syntax; ALTER IGNORE TABLE. The IGNORE option works a bit like 'INSERT IGNORE INTO table1 SELECT * FROM table2...' in that the first record remains in the table but subsequent records are silently dropped.

Here's a "fun" example of how this all works:
mysql> create table tp_test(
a int(11) auto_increment primary key,
b varchar(25) not null);
Query OK, 0 rows affected (0.05 sec)

mysql> show create table tp_test;
+---------+-----------------------------------------------+
+---------+-----------------------------------------------+
| tp_test | CREATE TABLE `tp_test` (
`a` int(11) NOT NULL auto_increment,
`b` varchar(25) NOT NULL,
PRIMARY KEY (`a`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 |
+---------+-----------------------------------------------+
1 row in set (0.00 sec)

mysql> insert into tp_test (b)
values
('Chickens'),('eat'),('corn'),('corn'),
('and'),('other'),('grains'),('grains');
Query OK, 4 rows affected (0.00 sec)

mysql> select * from tp_test;
+---+----------+
| a | b |
+---+----------+
| 1 | Chickens |
| 2 | eat |
| 3 | corn |
| 5 | corn |
| 6 | and |
| 7 | other |
| 8 | grains |
| 9 | grains |
+---+----------+
8 rows in set (0.00 sec)

mysql> ALTER IGNORE TABLE tp_test
ADD UNIQUE INDEX `test1` (`b`);
Query OK, 8 rows affected (0.00 sec)
Records: 8 Duplicates: 2 Warnings: 0

mysql> select * from tp_test;
+---+----------+
| a | b |
+---+----------+
| 1 | Chickens |
| 2 | eat |
| 3 | corn |
| 6 | and |
| 7 | other |
| 8 | grains |
+---+----------+
6 rows in set (0.00 sec)