I am often asked if it makes sense to place an index on a Boolean field in order to improve query performance. In general, because a boolean value can only have three values (True, False, Null), this low cardinality would suggest that adding an index will not help performance, as the query optimizer will still usually perform a table-scan if you have an even distribution of values within your DB.
One situation in which an index on a boolean field (or other low cardinality field) might be useful is if there are relatively few of one of the values, for example 5 True values in a table of millions of records and you are searching for those few values on a regular basis.
However, you might index a boolean value on a combination of fields. Indexing on a single Boolean might be pointless, because there’s only 2 (or 3) values. However, indexing on 16 boolean values has the potential of 2^16 values. It might help to make a combined index but you should understand how the combined index can and cannot be used and be aware that the order of the columns matters.
In general, you should always profile your system to see if there are queries that are too slow and consider adding another index to handle those queries. Sometimes a single combined index can be used for multiple queries and others time you will need to make an index for each type of query. Remember that adding indexes slows down modifications to the data so it is possible to have too many indexes. There is always a trade-off when creating multiple indexes.
Using ALTER IGNORE TABLE [TABLENAME] ADD UNIQUE INDEX `UNIQUE_INDEX` ([FIELDNAME]) to remove duplicate rows in a table is a fast an efficient process, however on large tables where the physical size is larger than server memory, the ALTER statement can take a long time to run in a production environment.
If you need to remove duplicates on a very large table (we recently used this on a table of 77 million rows), try this method :
delete t1 from table t1, table t2
where t1.duplicate_field= t2.duplicate_field (add more if need ie. and t1.duplicate_field2=t2.duplicate_field2)
and t1.unique_field > t2.unique_field
and breakup into ranges to run faster
If you use an auto-incrementing ID field as the primary key, use this as your unique field, and in the Where clause to run on a range of records to break into smaller operations.
I recently ran into a situation where a Delete across a large time period was taking an extensive amount of time to run when replicated to the Slave, although the query ran very quickly when running on the Master. On the Master, the query took
After much head scratching and verifying that indexes existed on both Master and Slave, we took a few minutes to compare the query execution plan using EXPLAIN. It was quickly obvious that the Slave was using a different index than the Master when executing the query. We quickly update the table statistics using Analyze Table, and re-ran EXPLAIN. After the Analyze Table, the problem was solved, and after restarting replication using Start Slave, the offending query quickly executed in 2 minutes.
Moral of the Story: Running Analyze Table on larger tables every couple days is probably a good idea.
The fastest and easiest way to delete duplicate records is my issuing a very simple command.
alter ignore table [tablename] add unique index `unique_index` ([fieldname])
What this does is create a unique index on the field that you do not want to have any duplicates. The ignore syntax instructs MySQL to not stop and display an error when it hits a duplicate. This is much easier than dumping and reloading a table.
This also will work, but is not as elegant:
delete from [tablename] where fieldname in (select a.[fieldname] from
(select [fieldname] from [tablename] group by [fieldname] having count(*) > 1 ) a )