I am trying to query a total of 46M rows and still not sure what index to use for this data, here my table looks like this;
CREATE TABLE IF NOT EXISTS "data" (
"deviceId" UUID,
"eventDate" DATE,
"eventHour" SMALLINT,
"eventTime" TIME(0) WITHOUT TIME ZONE,
"point" GEOMETRY(POINT, 4326),
"city_code" INTEGER,
"county_code" INTEGER,
"district_code" INTEGER,
"duration" INTEGER,
"deviceType" TEXT,
"cell" H3INDEX,
"yas" SMALLINT,
"ses" SMALLINT,
PRIMARY KEY ("deviceId", "eventDate", "eventTime")
);
I do not need to filter rows by device ID, I am looking to filter rows by cell mainly, cell represents what cell this point is in and looks something like this on map:

Cells are basically text data type and has values similar to 8c2d1c68a2d07ff and fixed length.
Basically I need to group the rows (points) by cells and filter them by yas, ses and eventTime (in hourly basis and between date ranges). yas and ses are category type of data, integers will be limited to only 1-10 representing different categories. I have tried these indexes but takes more than 1 second to compute for 46M rows and there will be 10x more rows than that:
CREATE INDEX IF NOT EXISTS "data_cell" ON "data" ("cell") WITH (FILLFACTOR = 100);
CREATE INDEX IF NOT EXISTS "data_ses_cell" ON "data" ("ses", "cell") WITH (FILLFACTOR = 100);
CREATE INDEX IF NOT EXISTS "data_yas_cell" ON "data" ("yas", "cell") WITH (FILLFACTOR = 100);
CREATE INDEX IF NOT EXISTS "data_date_cell" ON "data" ("eventDate", "cell") WITH (FILLFACTOR = 100);
CREATE INDEX IF NOT EXISTS "data_date_cell" ON "data" ("eventTime", "cell") WITH (FILLFACTOR = 100);
CREATE INDEX IF NOT EXISTS "data_date_cell" ON "data" ("eventHour", "cell") WITH (FILLFACTOR = 100);
CREATE INDEX IF NOT EXISTS "data_date_time_cell" ON "data" ("eventHour", "eventDate", "cell") WITH (FILLFACTOR = 100);
And here is a sample query and query planner result:
EXPLAIN ANALYZE
WITH ref AS (
SELECT refcell, h3_cell_to_children(refcell, 12) node
FROM (
SELECT h3_polygon_to_cells(ST_SetSRID(
ST_MakeBox2D(
ST_Point(28.93155097961426, 40.97555652808213),
ST_Point(29.058237075805668, 41.029513890837386)
), 4326
), 8) refcell
) as cells
), filtered AS (
SELECT cell, count(*)
FROM data
WHERE
"cell" IN (SELECT node FROM ref) AND
"eventDate" BETWEEN '2023-01-01' AND '2023-02-01' AND
"ses" = ANY(ARRAY[0]) AND
"yas" = ANY(ARRAY[0,1,2])
GROUP BY cell
)
SELECT refcell, sum(count)
FROM (
SELECT refcell, node, count
FROM ref, filtered
WHERE cell = ref.node
) as t
GROUP BY refcell;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=500785.48..500787.98 rows=200 width=40) (actual time=1322.053..1322.064 rows=60 loops=1)
Group Key: ref.refcell
Batches: 1 Memory Usage: 48kB
CTE ref
-> ProjectSet (cost=0.00..5022.77 rows=1000000 width=16) (actual time=173.051..184.126 rows=187278 loops=1)
-> ProjectSet (cost=0.00..5.27 rows=1000 width=8) (actual time=173.043..173.108 rows=78 loops=1)
-> Result (cost=0.00..0.01 rows=1 width=0) (actual time=172.681..172.682 rows=1 loops=1)
-> Hash Join (cost=159082.09..181762.72 rows=62800000 width=16) (actual time=1297.852..1321.680 rows=3705 loops=1)
Hash Cond: (ref.node = filtered.cell)
-> CTE Scan on ref (cost=0.00..20000.00 rows=1000000 width=16) (actual time=173.053..186.193 rows=187278 loops=1)
-> Hash (cost=158925.09..158925.09 rows=12560 width=16) (actual time=1124.736..1124.737 rows=3705 loops=1)
Buckets: 16384 Batches: 1 Memory Usage: 302kB
-> Subquery Scan on filtered (cost=158673.89..158925.09 rows=12560 width=16) (actual time=1123.696..1124.310 rows=3705 loops=1)
-> HashAggregate (cost=158673.89..158799.49 rows=12560 width=16) (actual time=1123.694..1124.116
rows=3705 loops=1)
Group Key: data.cell
Batches: 1 Memory Usage: 913kB
-> Nested Loop (cost=22500.56..156950.74 rows=344630 width=8) (actual time=91.799..1111.033 rows=91725 loops=1)
-> HashAggregate (cost=22500.00..22502.00 rows=200 width=8) (actual time=91.473..137.551 rows=187278 loops=1)
Group Key: ref_1.node
Batches: 5 Memory Usage: 11073kB Disk Usage: 3472kB
-> CTE Scan on ref ref_1 (cost=0.00..20000.00 rows=1000000 width=8) (actual tim
e=0.001..42.754 rows=187278 loops=1)
-> Index Scan using data_ses_cell on data (cost=0.56..671.69 rows=55 width=8) (actual time=0.001..0.005 rows=0 loops=187278)
Index Cond: ((ses = ANY ('{0}'::integer[])) AND (cell = ref_1.node))
Filter: (("eventDate" >= '2023-01-01'::date) AND ("eventDate" <= '2023-02-01'::date) AND (yas = ANY ('{0,1,2}'::integer[])))
Rows Removed by Filter: 1
Planning Time: 0.273 ms
JIT:
Functions: 41
Options: Inlining true, Optimization true, Expressions true, Deforming true
Timing: Generation 2.609 ms, Inlining 7.226 ms, Optimization 107.961 ms, Emission 62.515 ms, Total 180.311 ms
Execution Time: 1325.916 ms
(31 rows)
Server specs are 64GB Ram, Ryzen 5 3600 and 512GB NVMe ssd.
I at least need to execute this kind of query under 500ms time. Is it possible?
Should I continue to use PostgreSQL at the first place for that many rows? Or if postgresql can handle that many data points, what am I doing wrong?
Thanks!
EDIT I have changed structure a bit and this is the final state:
New indexes:
CREATE INDEX IF NOT EXISTS "data_ses_cell" ON "data" ("cell", "ses", "yas") WITH (FILLFACTOR = 100);
CREATE INDEX IF NOT EXISTS "data_date_time_cell" ON "data" USING BRIN ("eventTime") WITH (FILLFACTOR = 100);
EXPLAIN (ANALYZE, BUFFERS) WITH ref AS (
SELECT refcell, h3_cell_to_children(refcell, 12) node
FROM (
SELECT h3_polygon_to_cells(ST_SetSRID(
ST_MakeBox2D(
ST_Point(28.87567520141602, 40.95903013727966),
ST_Point(29.12904739379883, 41.06692773019345)
), 4326
), 9) refcell
) as cells
), filtered AS (
SELECT cell, count(*)
FROM data
WHERE "cell" IN (SELECT node FROM ref) AND "eventTime"::DATE BETWEEN '2023-01-01' AND '2023-01-01' AND "ses" = ANY(ARRAY[0]) AND "yas" = ANY(ARRAY[1,2,3])
GROUP BY cell
)
SELECT refcell, sum(count)
FROM (
SELECT refcell, node, count
FROM ref, filtered
WHERE cell = ref.node
) as t
GROUP BY refcell;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=144023.36..144025.86 rows=200 width=40) (actual time=4356.905..4356.967 rows=431 loops=1)
Group Key: ref.refcell
Batches: 1 Memory Usage: 173kB
Buffers: shared hit=8999147 read=304826 dirtied=32 written=11724, temp read=6321 written=8197
I/O Timings: shared/local read=476.680 write=29.652, temp read=7.378 write=24.451
CTE ref
-> ProjectSet (cost=0.00..5022.77 rows=1000000 width=16) (actual time=1.766..48.559 rows=750827 loops=1)
-> ProjectSet (cost=0.00..5.27 rows=1000 width=8) (actual time=1.764..2.680 rows=2189 loops=1)
-> Result (cost=0.00..0.01 rows=1 width=0) (actual time=0.000..0.001 rows=1 loops=1)
-> Hash Join (cost=72494.97..95175.59 rows=8765000 width=16) (actual time=4245.857..4356.724 rows=858 loops=1)
Hash Cond: (ref.node = filtered.cell)
Buffers: shared hit=8999147 read=304826 dirtied=32 written=11724, temp read=6321 written=8197
I/O Timings: shared/local read=476.680 write=29.652, temp read=7.378 write=24.451
-> CTE Scan on ref (cost=0.00..20000.00 rows=1000000 width=16) (actual time=1.767..54.292 rows=750827 loops=1)
Buffers: temp read=2383 written=1
I/O Timings: temp read=2.751 write=0.028
-> Hash (cost=72473.06..72473.06 rows=1753 width=16) (actual time=4244.035..4244.037 rows=858 loops=1)
Buckets: 2048 Batches: 1 Memory Usage: 57kB
Buffers: shared hit=8999147 read=304826 dirtied=32 written=11724, temp read=3938 written=8196
I/O Timings: shared/local read=476.680 write=29.652, temp read=4.627 write=24.423
-> Subquery Scan on filtered (cost=72424.85..72473.06 rows=1753 width=16) (actual time=4243.319..4243.933 rows=858 loops=1)
Buffers: shared hit=8999147 read=304826 dirtied=32 written=11724, temp read=3938 written=8196
I/O Timings: shared/local read=476.680 write=29.652, temp read=4.627 write=24.423
-> GroupAggregate (cost=72424.85..72455.53 rows=1753 width=16) (actual time=4243.318..4243.873 rows=858 loops=1)
Group Key: data.cell
Buffers: shared hit=8999147 read=304826 dirtied=32 written=11724, temp read=3938 written=8196
I/O Timings: shared/local read=476.680 write=29.652, temp read=4.627 write=24.423
-> Sort (cost=72424.85..72429.23 rows=1753 width=8) (actual time=4243.313..4243.460 rows=4456 loops=1)
Sort Key: data.cell
Sort Method: quicksort Memory: 193kB
Buffers: shared hit=8999147 read=304826 dirtied=32 written=11724, temp read=3938 written=8196
I/O Timings: shared/local read=476.680 write=29.652, temp read=4.627 write=24.423
-> Nested Loop (cost=22500.56..72330.40 rows=1753 width=8) (actual time=330.563..4242.929 rows=4456 loops=1)
Buffers: shared hit=8999147 read=304826 dirtied=32 written=11724, temp read=3938 written=8196
I/O Timings: shared/local read=476.680 write=29.652, temp read=4.627 write=24.423
-> HashAggregate (cost=22500.00..22502.00 rows=200 width=8) (actual time=326.702..602.036 rows=750827 loops=1)
Group Key: ref_1.node
Batches: 21 Memory Usage: 11073kB Disk Usage: 19600kB
Buffers: temp read=3938 written=8196
I/O Timings: temp read=4.627 write=24.423
-> CTE Scan on ref ref_1 (cost=0.00..20000.00 rows=1000000 width=8) (actual time=0.000..183.769 rows=750827 loops=1)
Buffers: temp written=2382
I/O Timings: temp write=9.866
-> Index Scan using data_ses_cell on data (cost=0.56..249.13 rows=1 width=8) (actual time=0.004..0.004 rows=0 loops=750827)
Index Cond: ((cell = ref_1.node) AND (ses = ANY ('{0}'::integer[])) AND (yas = ANY ('{1,2,3}'::integer[])))
Filter: ((("eventTime")::date >= '2023-01-01'::date) AND (("eventTime")::date <= '2023-01-01'::date))
Rows Removed by Filter: 0
Buffers: shared hit=8999147 read=304826 dirtied=32 written=11724
I/O Timings: shared/local read=476.680 write=29.652
Planning:
Buffers: shared hit=29
Planning Time: 0.305 ms
Execution Time: 4361.263 ms
(53 rows)
EDIT 2
We have added a month of data and there is now total 200M rows.
I changed the table a bit and indexes;
CREATE TABLE IF NOT EXISTS "data" (
device_id UUID, -- Indicates the device ID
event_time TIMESTAMP WITH TIME ZONE, -- Event Time
city_code INTEGER, -- City Code
county_code INTEGER, -- County Code
district_code INTEGER, -- District Code
-- point GEOMETRY(POINT, 4326), -- Point (lon, lat)
duration INTEGER, -- Duration (seconds)
num_pings SMALLINT, -- Number of pings
cell TEXT, -- Cell (H3)
yas SMALLINT, -- Age
ses SMALLINT -- Socio-Economic Status
);
CREATE INDEX IF NOT EXISTS "data_event_time_idx" ON "data" (event_time);
CREATE INDEX IF NOT EXISTS "data_event_time_idx" ON "data" using brin (event_time);
CREATE INDEX IF NOT EXISTS "data_cell_idx" ON "data" (cell);
CREATE INDEX IF NOT EXISTS "data_ses_yas_event_time_cell_idx" ON "data" (ses, yas, event_time, cell);
Current status:
EXPLAIN (ANALYZE, BUFFERS) WITH ref AS (
SELECT refcell::text, h3_cell_to_children(refcell, 12)::text node
FROM (
SELECT h3_polygon_to_cells(ST_SetSRID(
ST_MakeBox2D(
ST_Point(28.96532535552979, 41.01392390972017),
ST_Point(28.981161117553714, 41.020667069908704)
), 4326
), 12) refcell
) as cells
), filtered AS (
SELECT cell, count(*)
FROM data
WHERE
"cell" IN (SELECT node FROM ref) AND
"event_time" BETWEEN '2023-01-01' AND '2023-02-01' AND
"ses" = ANY(ARRAY[0]) AND
"yas" = ANY(ARRAY[0,1,2])
GROUP BY cell
)
SELECT refcell, sum(count)
FROM (
SELECT refcell, node, count
FROM ref, filtered
WHERE cell = ref.node
) as t
GROUP BY refcell;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=4697558.19..4697560.69 rows=200 width=64) (actual time=5569.104..5573.877 rows=14 loops=1)
Group Key: ref.refcell
Batches: 1 Memory Usage: 40kB
Buffers: shared hit=15515 read=2288426
CTE ref
-> Result (cost=0.00..27522.76 rows=1000000 width=64) (actual time=197.172..199.926 rows=2936 loops=1)
-> ProjectSet (cost=0.00..5022.77 rows=1000000 width=16) (actual time=197.167..198.989 rows=2936 loops=1)
-> ProjectSet (cost=0.00..5.27 rows=1000 width=8) (actual time=197.164..198.240 rows=2936 loops=1)
-> Result (cost=0.00..0.01 rows=1 width=0) (actual time=194.764..194.765 rows=1 loops=1)
-> Hash Join (cost=4102904.80..4125585.42 rows=108890000 width=40) (actual time=5568.535..5573.854 rows=14 loops=1)
Hash Cond: (ref.node = filtered.cell)
Buffers: shared hit=15515 read=2288426
-> CTE Scan on ref (cost=0.00..20000.00 rows=1000000 width=64) (actual time=197.173..197.334 rows=2936 loops=1)
-> Hash (cost=4102632.57..4102632.57 rows=21778 width=24) (actual time=5371.307..5376.075 rows=14 loops=1)
Buckets: 32768 Batches: 1 Memory Usage: 257kB
Buffers: shared hit=15515 read=2288426
-> Subquery Scan on filtered (cost=4102197.01..4102632.57 rows=21778 width=24) (actual time=5371.209..5376.066 rows=14 loops=1)
Buffers: shared hit=15515 read=2288426
-> HashAggregate (cost=4102197.01..4102414.79 rows=21778 width=24) (actual time=5371.207..5376.062 rows=14 loops=1)
Group Key: data.cell
Batches: 1 Memory Usage: 793kB
Buffers: shared hit=15515 read=2288426
-> Hash Join (cost=23504.50..4100271.86 rows=385030 width=16) (actual time=2286.179..5368.528 rows=49557 loops=1)
Hash Cond: (data.cell = ref_1.node)
Buffers: shared hit=15515 read=2288426
-> Gather (cost=1000.00..4071462.50 rows=770059 width=16) (actual time=2281.522..5285.847 rows=753398 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=15515 read=2288426
-> Parallel Seq Scan on data (cost=0.00..3993456.60 rows=320858 width=16) (actual time=2258.752..5282.661
rows=251133 loops=3)
Filter: ((ses = ANY ('{0}'::integer[])) AND (event_time >= '2023-01-01 00:00:00+00'::timestamp with ti
me zone) AND (event_time <= '2023-02-01 00:00:00+00'::timestamp with time zone) AND (yas = ANY ('{0,1,2}'::integer[])))
Rows Removed by Filter: 67329851
Buffers: shared hit=15515 read=2288426
-> Hash (cost=22502.00..22502.00 rows=200 width=32) (actual time=4.643..4.644 rows=2936 loops=1)
Buckets: 4096 (originally 1024) Batches: 1 (originally 1) Memory Usage: 170kB
-> HashAggregate (cost=22500.00..22502.00 rows=200 width=32) (actual time=4.071..4.289 rows=2936 loops=1)
Group Key: ref_1.node
Batches: 1 Memory Usage: 385kB
-> CTE Scan on ref ref_1 (cost=0.00..20000.00 rows=1000000 width=32) (actual time=0.002..3.427 rows=
2936 loops=1)
Planning Time: 0.262 ms
JIT:
Functions: 51
Options: Inlining true, Optimization true, Expressions true, Deforming true
Timing: Generation 3.719 ms, Inlining 73.689 ms, Optimization 176.614 ms, Emission 109.112 ms, Total 363.134 ms
Execution Time: 5576.688 ms
(45 rows)
5 seconds do not sound very much when I consider it is 200M data, still trying to figure out a way to make it faster, maybe some intermediary table is good too but there need to be filtering to the yas and ses data so I am not sure how to do it.