1

I want to iterate through the first $k$ elements of a randomly ordered list containing all subtrees for a given tree. The definition of subtrees that I'm using is: "A subtree of $T$ is a subgraph of $T$ that is also a tree". For example, a tree $A(B(D, E, F), C(G, H))$ should have 62 subtrees if my math is correct.

The amount of subtrees can be calculated as follows. Let us denote by $R(T)$ the number of subtrees of $T$ rooted at $T$. If $T$ is a leaf then $R(T) = 1$, and if tree has children $T_1,\ldots,T_\ell$ then $$ R(T) = \prod_{i=1}^\ell (R(T_i) + 1). $$ Finally, the number of subtrees of $T$ is $\sum_{x \in T} R(T_x)$, where $x$ goes over all vertices of $T$, and $T_x$ is the subtree induced by $x$ (consisting of $x$ and all of its progeny).

By shuffling the order of these trees and listing the first 5, the result might be $C(G), A(B(F)), B(D, E), A(B, C), C$.

I can already generate a full list of subtrees. For larger trees, this quickly becomes impossibly large. I am looking for a way of generating these subtrees without having to generate all of them.

One possible method I've thought of is to keep a count of how many unseen rooted subtrees each of the nodes has left, and decrementing each time a subtree is generated from that node. The probability that a node is selected would be weighted by this counter. I would also have to prevent the generation of duplicates. I was wondering if there is a better way.

Yuval Filmus
  • 280,205
  • 27
  • 317
  • 514
anlew
  • 13
  • 4

1 Answers1

1

If $k \leq (1-\epsilon) N$, where $N$ is the total number of subtrees, then the following approach would work:

  1. Start with the empty list $L$.

  2. Repeat $k$ times:

    • Pick a random subtree $T'$.
    • If $T' \notin L$, add it to $L$; otherwise, go back to the previous step.

Since $k \leq (1-\epsilon) N$, the expected number of times it takes to find $T' \notin L$ is at most $1/\epsilon$ throughout the process; it will be much smaller in the beginning. In particular, if $k \ll \sqrt{N}$, then it is highly likely that you will never generate the same subtree twice.

In more detail, the average expected number of repetitions is $$ \frac{1}{k} \sum_{\ell=0}^{k-1} \frac{N}{N-\ell} \approx \frac{N}{k} \int_0^k \frac{dx}{N-x} = \frac{N}{k} \ln \frac{N}{N-k} \approx \frac{N}{N-k} = 1 + \frac{k}{N-k}. $$

You can implement the check $T' \notin L$ quickly using a hashtable. In this way, we have reduced your problem to that of generating a uniformly random subtree, which you can do as follows, essentially by reducing the problem of uniform generation to that of counting.

In order to pick the root of the subtree, first compute $R(T_x)$ for every vertex $x \in T$ (you can do this in $O(n)$ for all vertices together if you're careful, where $n$ is the number of vertices). The root is $x$ with probability $R(T_x)/N$ (you can choose $x$ quickly using binary search, for example).

If $x$ is a leaf, then we're done. Otherwise, suppose that $x$ has children $x_1,\ldots,x_\ell$. Your random subtree skips $x_i$ with probability $1/(1+R(T_{x_i}))$ (independently). If it doesn't skip $x_i$, then you generate a random subtree of $T_{x_i}$ recursively.


Here are two other related approaches. The first is to generate all subtrees, permute them randomly in $O(N)$, and then output the prefix of length $k$.

A variant of the first approach uses unranking. By modifying the approach above, you can take an integer in the range $0,\ldots,N-1$ and convert it to a subtree. This goes as follows. Let $x_1,\ldots,x_n$ be an enumeration of the vertices of $T$. The first $R(T_{x_1})$ integers correspond to subtrees rooted at $x_1$. The following $R(T_{x_2})$ integers correspond to subtrees rooted at $x_2$. And so on.

Now suppose we're given an integer $i$ in the range $0,\ldots,R(T_x)-1$, and need to convert it to a subtree rooted at $x$. If $x$ is a leaf, then there is nothing to do. Otherwise, let $x_1,\ldots,x_\ell$ be the children of $x$. We first convert $i$ into $\ell$ numbers $i_1,\ldots,i_\ell$, where $i_j$ is in the range $0,\ldots,R(T_{x_{i_j}})$. If $i_j = R(T_{x_{i_j}})$, then the subtree won't contain $x_{i_j}$. Otherwise, we can generate a subtree of $x_{i_j}$ recursively.

Given the unranking procedure, you can generate a permutation of $0,\ldots,N-1$, take the prefix of length $k$, and convert it to a list of $k$ subtrees. If you have any other way of generating a random sequence of $k$ elements of $0,\ldots,N-1$ without repetition, then you can use it in the same way.

Yuval Filmus
  • 280,205
  • 27
  • 317
  • 514