Objectives
I would like to run pgr_dijkstra for each pair of vertices in a routable graph. I have a simple script that takes each pair of vertices and launches a pgr_dijkstra against the database.
It would be nice to use PostgreSQL parallel query capability to speed it up.
I would like to achieve an "easy" speed up without a need to implement advanced code like contractions.
Problem
Although I can see that for some of my tables explain analyze shows a plan that involves Gather and Parallel Seq Scan, trying to run pgr_dijkstra with "parallel" parameters enabled and with the PARALLEL SAFE keywords actually makes it slower.
Input data
I have a PostGIS-enabled database with some OSM data imported with osm2pgsql:
| Data | Size |
|---|---|
| Nodes (vertices) table | 730 vertices |
| Nodes (vertices) table | 160 kB |
| Edges table | 1100 vertices |
| Edges table | 656 kB |
PostgreSQL and system configuration
CPU: 11th Gen Intel Core i7-1165G7
| PostgreSQL parameter | Value |
|---|---|
min_parallel_table_scan_size |
8kB |
force_parallel_mode |
off |
parallel_setup_cost |
10 |
parallel_tuple_cost |
0.001 |
max_parallel_workers_per_gather |
8 |
max_parallel_workers |
8 |
max_worker_processes |
8 |
I have set force_parallel_mode to off as per this recommendation, but even when on it does not make any difference.
Changing parallel_tuple_cost from the default value to 0.001 does not seem to help.
Queries and results
I have set up a very simple pgr_dijkstra call with and without SAFE, with origin/destination vertices' ids hardcoded:
CREATE OR REPLACE
FUNCTION pgr_parallel()
RETURNS VOID
AS $$
BEGIN
PERFORM pgr_dijkstra(
'
SELECT gid AS id,
source,
target,
length_m AS cost
FROM edges_table
WHERE
NOT source IS NULL
AND
NOT target IS NULL
AND
NOT the_geom IS NULL
AND NOT length_m IS NULL
',
ARRAY[161]
,
ARRAY[639]
,
directed := false
);
END;
$$ LANGUAGE 'plpgsql'
PARALLEL SAFE;
I have a simple script that calls this function N times, and it does not seem that it runs in a parallel way:
| Type of function | Number (N) of pairs of vertices | Execution time |
|---|---|---|
| undefined (not parallel) | 50,000 | 67 seconds |
PARALLEL SAFE |
50,000 | 89 seconds |
As an additional observation, I run htop and see that in the beginning, when the (parallel) script runs, all 8 physical cores are being used. But when the script finishes firing "tasks" to the database, htop shows that only one core gets used by the postgres process.
Questions
If I understand correctly, a query can be 'parallelized' if there is some significant independent work to be done by each worker. In case of pgr_dijkstra and any other shortest path-finding algorithm, isn't this the case? I would suppose that pgr_dijkstra fetches vertices and edges and then runs an independent computation on them.
Is it possible to use the parallel query capability of PostgreSQL to make at least some parts of pgr_dijkstra to run in parallel? If yes, how?
Note: In this post by Crunchy Data and Paul Ramsey, PARALLEL SAFE is used for a function that contains pgr_dijkstra, but there is no detail if it runs faster thanks to that.