Efficient algorithm to find subgraph

Question

I have a really nasty problem (for me) at hand and I was wishing some of you may know an efficient algorithm to solve it. Thanks to all of you in advance.

My problem

I have a set of elements (that I would describe as a graph) that may be of three categories:

Provider nodes
Consumer nodes
Interconnections between nodes

Each node can have one or more interconnections tied to it.

My goal is to correlate each consumer to a set of providers to whom is connected, both directly through an interconnection or indirectly through multiple hops from one consumer to the other. Indirect interconnections can't go through a provider, only a consumer.

Our current approach is to analyze this graph starting with providers, hopping through interconnections recursively until a all unvisited consumers are hit. We then take this "island" of consumers bordered by a set of producers and assign these producers to each consumer we visited.

The approach gave us the right results but at a very high cost, in terms of time spent analyzing the graph.

Do you have any suggestion for an algorithm that can come in handy to solve this problem? I would also consider alternative data structures to hold our input (right now they are a set of tuples in a relational database describing the correlation between nodes and interconnections).

score 6 · Accepted Answer · answered Jul 09 '14 at 22:47

OK, it sounds like your problem is the following:

You have an undirected graph. Each vertex is either a consumer or a producer. Say that a consumer $c$ is connected to a producer $p$ if there is a path from $c$ to $p$ where all intermediate nodes are consumer nodes. You want to output a list of pairs $(c,p)$ where $c$ is connected to $p$.

This problem can be solved in a straightforward, clean, and efficient manner, using an algorithm for connected components. Here is the algorithm:

Remove all of the producer nodes, keeping only the consumer nodes and the edges from a consumer node to a consumer node.
Compute connected components on this smaller graph. Let the connected components be $C_1,\dots,C_k$.
Now, back in the original graph, shrink all of the consumer nodes in the same connected component into a single node. Thus, we get an edge from connected component $C_i$ to producer $p$ if there is any consumer $c$ in $C_i$ that has an edge to $p$.
This resulting graph now describes the connectivity relationship. In particular, for each consumer $c$ and each producer $p$, $c$ is connected to $p$ if and only $C_i$ has an edge to $p$, where $C_i$ is the connected component containing $c$.

For instance, to output the list of all pairs $(c,p)$ that are connected, you can do the following:
- For each connected component $C_i$, for each provider $p$ that has an edge from $C_i$ to $p$, for each consumer $c$ in $C_i$, output the pair $(c,p)$.

This algorithm runs in linear time. The running time to compute the connectivity graph (produced in steps 1-3) is $O(m+n)$, where $n$ is the number of nodes and $m$ is the number of edges (interconnections) in the original graph. The running time in step 4 could be as large as $\Theta(n^2)$, since there might be as many as $\Theta(n^2)$ such pairs. However, here's what we can say. If the correct output has length $\ell$ (i.e., there are $\ell$ pairs $(p,c)$ that are connected), then step 4 will have running time $O(\ell)$.

Therefore, the total running time is $O(m+n+\ell)$, i.e., linear in the sum of the size of the input and the size of the output. This is as fast as you could possibly hope for.

If you don't need to output the set of all pairs $(c,p)$ explicitly, but you just want a compact representation of this connectivity information, you can use the graph computed in steps 1-3 as that compact representation. Then the running time is $O(m+n)$.

Apiwat Chantawibul · Answer 2 · 2014-07-09T22:55:28.990

I can't tell if my answer will be applicable because the question is not precisely defined. But let me make some guesses here:

Input graphs with 2 types of vertex.

Provider node (P)
Consumer node (C)

The edges are constrained by:

(P) can connect to (C)
(C) can connect to (C)
(P) can not connect to (P) (or we just don't care if they're connected)

We say that a producer $p \in P$ can reach $c_* \in C$ denoted with $p \rightarrow c_*$ if:

there is a path starting from $p$ then going through customers only $c_1,c_2,\ldots,c_* \in C$ ending at $c_*$.

You might be interested in the questions:

Given a $p \in P$, find $C_p = \{c \mid c \in C \text{ and } p \rightarrow c\}$. (The set of consumers reachable from $p$)
Given a $c \in C$, find $P_c = \{p \mid p \in P \text{ and } p \rightarrow c\}$. (The set of providers that can provide for $c$)

EDIT: see D.W.'s answer for calculation.

score -1 · Answer 3 · answered Jul 09 '14 at 15:13

(am going to take a stab at this with the partial information at this point & possibly modify it on further clarification by questioner.) the Djikstra shortest path algorithm (which basically adds the shortest weighted new edges in the "frontier set") can be run in parallel across multiple starting nodes (but havent seen a ref to this). it sounds like that would give you close to what you want. the starting nodes in your case are the providers. as the edge/vertex set expands, this will eventually "touch" all consumers, and the shortest path to any provider from any consumer can be calculated.

Efficient algorithm to find subgraph

My problem

3 Answers3