PostgreSQL select distinct JSONB poor performance

Question

Let's say I have the following PostgreSQL table called products which has millions of records:

CREATE TABLE IF NOT EXISTS mytable (
    id serial NOT NULL PRIMARY KEY,
    label VARCHAR(50) NOT NULL,
    info jsonb NOT NULL,
    created_at timestamp NOT NULL DEFAULT now()
);

I have a SELECT statement targeting this table with several JOIN's that is generating duplicate rows. This appears to be common: Why do multiple-table joins produce duplicate rows?

I know I can fix this issue using SELECT DISTINCT .... However, the query is taking several seconds, whereas the vanilla SELECT ... query takes milliseconds.

I presume this has to do with the info JSONB field, which can be very large. When I remove info from the DISTINCT calculation by using SELECT DISTINCT ON (id) ... then the query is much faster.

However, DISTINCT ON breaks some of my queries that use ORDER BY [non-id field] due to this condition:

SELECT DISTINCT ON expressions must match initial ORDER BY expressions

I've noticed I can fix the error by using subqueries (good example here):

SELECT * FROM (
    SELECT DISTINCT ON (id) ...
) ORDER BY [non-id field]

Two questions:

Is passing a JSONB field through DISTINCT a known performance problem? I want to make sure my theory is reasonable.
Is my solution of using subqueries a good solution for fixing the SELECT DISTINCT ON expressions must match initial ORDER BY expressions error? Or is there a better solution I'm not thinking of?

I have added very specific questions [answering your previous question](https://stackoverflow.com/a/74426829/939860). Addressing them would help this new question about performance. Plus (always!) your version of Postgres. — Erwin Brandstetter, Nov 14 '22 at 05:22

score 1 · Answer 1 · answered Nov 14 '22 at 05:19

Is passing a JSONB field through DISTINCT a known performance problem?

Passing any large column through DISTINCT is a known performance problem. jsonb isn't special in this regard.

Or is there a better solution I'm not thinking of?

That depends. You didn't provide requested information, yet.

score 0 · Answer 2 · answered Nov 14 '22 at 07:12

0

You can get by with a single sort and still have good performance:

SELECT DISTINCT ON ([non-id column], id) ...
ORDER BY [non-id column], id;

answered Nov 14 '22 at 07:12

Laurenz Albe

209,280
17
206
263

PostgreSQL select distinct JSONB poor performance

2 Answers2