You're walking in the woods of your database
There's no one around, and your laptop battery is (almost) dead
Out of the corner of your eye, you spot them - duplicates
It starts bugging you, "How did they get there?"
You select all the data and scratch your head
But you start to see it's not just a few rows
There are duplicates everywhere!
(Heavily inspired by Rob Cantor's Shia LaBeouf )
You identify 1 or 2 columns where you can sort out those nasty repeated rows of information.
You only want to keep one of them, a
NOT NULL & not blank value.
PARTITION_BY!", I hear someone shouting from the imaginary crowd for this horror tale (it was actually Joana).
Except....You're using MySQL. *sigh*
The ONLY_FULL_GROUP_BY flag (side note)
If you're having issues with your current query for a similar problem, know that this little flag is enabled by default since MySQL 5.7.
It won't allow you to use columns in the
SELECT statement that are not included in the
GROUP_BY statement, except if those are functions (like
You could disable it and carry on with your life...
mysql > SET GLOBAL sql_mode=(SELECT REPLACE(@@sql_mode, 'ONLY_FULL_GROUP_BY', ''));
But then again, why are you reading this?
In case you don't notice it during the following section, I'll be working with translations and I have repeated pairs of
translated_object_id - a foreign_key. I only need to keep 1 valid translation.
SELECT ANY_VALUE(id) id, translated_object_id, locale FROM translated_objects GROUP BY translated_object_id, locale HAVING COUNT(*) > 1;
My first attempt was invalid.
I searched about the ONLY_FULL_GROUP_BY flag as soon as I ran my query when I initially thought I could just get the ID field I needed as long as I grouped and counted the resulting rows.
As a good programmer, I turned to StackOverflow and promptly realised my mistake, so I added the
ANY_VALUE function because I was "sure that all the results inside the group are the same".
SELECT t0.id FROM translated_objects as t0 LEFT JOIN translated_objects AS t1 ON t0.translated_object_id = t1.translated_object_id AND t0.locale = t1.locale AND t1.id > t0.id WHERE t1.translated_object_id IS NOT NULL;
The next chapter was basically a re-iteration of the previous solution, less fancier but it was still getting the job done.
I was simply looking for an alternative to the
GROUP_BY query after my first failure, and when I saw that my previous solution did not provide all the results I needed, I had to give this query a shot.
Still, after digging further and finally understanding my problem, I knew this path would never lead me to shore, since I had 4 or 5 duplicated lines per
GROUP_BY and these queries only returned 1 of them.
I knew what I needed & I couldn't avoid it any longer. (Joana was right...)
Partition By w/ 1 column (explanation)
SELECT * FROM translated_objects WHERE id IN ( SELECT id FROM ( SELECT @row_number := IF(@previous_value=o.translated_object_id,@row_number+1,1) AS RowNumber ,o.id ,o.translated_object_id ,o.locale ,@previous_value := o.translated_object_id FROM translated_objects o, (SELECT @row_number := 1) x, (SELECT @previous_value := '') y ORDER BY o.translated_object_id, o.updated_at DESC ) subquery WHERE RowNumber > 1 );
If you checked the previous queries, you will notice this wasn't actually a solution to my problem since I need to take into consideration 2 columns. For completeness sake, and for posterity, I'll leave a partition query with only 1 column.
MySQL does not have a
ROW_NUMBER function like other database softwares do, which allows to differentiate between grouped results by conveniently assigning them - you guessed it - a
row_number. However, it does have the ability to use variables relatively freely, so I only need to create my own.
Furthermore, I also need to keep track of the partition so I know when to reset the
row_number. That's why I use the
previous_value, carefully checking when my foreign key changes with the
IF statement you see on the subquery.
Enough chit-chat. Let's move on!
Partition By w/ 2 columns (or more...)
SELECT * FROM translated_objects WHERE id IN ( SELECT id FROM ( SELECT @row_number := IF(@previous_value=concat_ws('',o.translated_object_id, o.locale),@row_number+1,1) AS RowNumber ,o.id ,o.translated_object_id ,o.locale ,@previous_value := concat_ws('',o.translated_object_id, o.locale) FROM translated_objects o, (SELECT @row_number := 1) x, (SELECT @previous_value := '') y ORDER BY o.translated_object_id, o.locale, (CASE WHEN o.name IS NULL THEN 0 ELSE 1 END) DESC, o.updated_at ASC ) subquery WHERE RowNumber > 1 );
Sweet Christmas! Finally got it!
Like stated before, I needed to
PARTITION BY 2 columns - my foreign_key
locale - so I have to
CONCAT these values.
CONCAT_WS here since unlike your regular brand of concatenation, this method will actually skip any NULL values and return the resulting (expected) string. This is now my main control value, so it updates my
Finally, since I want to
ORDER BY the latest
NOT NULL value, I added an explicit
CASE to the statement.
SELECT * FROM translated_objects WHERE id IN ( SELECT id FROM ( SELECT @row_number := IF(@previous_value = CONCAT_WS('', o.translated_object_id, o.locale), @row_number + 1 ,1) AS RowNumber ,o.id ,@previous_value := CONCAT_WS('', o.translated_object_id, o.locale) FROM ( SELECT translated_objects.id, translated_objects.translated_object_id, translated_objects.locale, translated_objects.name, translated_objects.updated_at FROM translated_objects, ( SELECT translated_object_id, locale, COUNT(*) AS count FROM translated_objects GROUP BY translated_object_id, locale HAVING COUNT(*) > 1 ) AS repeated WHERE translated_objects.translated_object_id = repeated.translated_object_id AND translated_objects.locale = repeated.locale ) o, (SELECT @row_number := 1) x, (SELECT @previous_value := '') y ORDER BY o.translated_object_id, o.locale, (CASE WHEN o.name IS NULL OR translated_objects.name = '' THEN 0 ELSE 1 END) DESC, o.updated_at ASC ) subquery WHERE RowNumber > 1 );
You might not notice it, but the
CASE in the
ORDER BY line was updated. If, like me, you had a lot of empty non-null lines in your rows, you are going to need to filter those out too if you really want to obtain the last valid result.
However, "empty" might be a valid result for you, so I decided to only include it after for the article's purpose.
Now, when working with large tables, I was still waiting ~5 minutes before I could get any result. I was accessing the whole table with
FROM translated_objects o, which probably loads the dataset to memory to work on it when the partitions extend for a large number of rows.
Therefore, to reduce the amount of data needed, search no further then the very first attempt in the article, with a small twist (thanks Filipe!).
SELECT translated_object_id, locale, COUNT(*) AS count FROM translated_objects GROUP BY translated_object_id, HAVING COUNT(*) > 1
Instead of fetching
ANY_VALUE from the table, simply count the number of results from the
GROUP BY and return the parsed set of data. Furthermore, simply select the columns you absolutely need in order to work with the minimum amount of data required.
CREATE INDEX idx ON translated_objects (translated_object_id, locale);
In order to accelerate the
subsubsubsubquery, add an index on both columns from
GROUP BY translated_object_id, locale to make it even faster.
CREATE UNIQUE INDEX idx ON translated_objects (translated_object_id, locale);
Add an unique index on both columns from
GROUP BY translated_object_id, locale after to avoid any future errors.
You're finally safe...from those nasty duplicates.