Let's say I have the following PostgreSQL table called products which has millions of records:
CREATE TABLE IF NOT EXISTS mytable (
id serial NOT NULL PRIMARY KEY,
label VARCHAR(50) NOT NULL,
info jsonb NOT NULL,
created_at timestamp NOT NULL DEFAULT now()
);
I have a SELECT statement targeting this table with several JOIN's that is generating duplicate rows. This appears to be common: Why do multiple-table joins produce duplicate rows?
I know I can fix this issue using SELECT DISTINCT .... However, the query is taking several seconds, whereas the vanilla SELECT ... query takes milliseconds.
I presume this has to do with the info JSONB field, which can be very large. When I remove info from the DISTINCT calculation by using SELECT DISTINCT ON (id) ... then the query is much faster.
However, DISTINCT ON breaks some of my queries that use ORDER BY [non-id field] due to this condition:
SELECT DISTINCT ON expressions must match initial ORDER BY expressions
I've noticed I can fix the error by using subqueries (good example here):
SELECT * FROM (
SELECT DISTINCT ON (id) ...
) ORDER BY [non-id field]
Two questions:
- Is passing a JSONB field through
DISTINCTa known performance problem? I want to make sure my theory is reasonable. - Is my solution of using subqueries a good solution for fixing the
SELECT DISTINCT ON expressions must match initial ORDER BY expressionserror? Or is there a better solution I'm not thinking of?