Guessing a subset of $\{1,...,N\}$

Question

I pick a random subset $S$ of $\{1,\ldots,N\}$, and you have to guess what it is. After each guess $G$, I tell you the number of elements in $G \cap S$. How many guesses do you need?

see http://www.cs.mcgill.ca/~colt2009/papers/004.pdf if you know the cardinal of $S$ (which you can get in one guess) ( this could get a very good bound on average) — Elaqqad, Aug 25 '17 at 20:26
Very nice! This answers my question, as it shows that the number of guesses required is $\Theta(N/\log(N))$. — Dave Radcliffe, Aug 26 '17 at 01:48
@DaveRadcliffe huh? didn't mjqxxxx show that $N/\log N$ guesses are required. The importance of the link is that $N/\log N$ guesses are sufficient. — mathworker21, Apr 12 '20 at 19:31
This is similar to brute force solving a true-false test, and the answer I give there applies here. — Mike Earnest, Apr 02 '21 at 19:02
Back in the days, when you didn't need much details in your question. — Star Alpha, Jul 15 '21 at 10:24
My question was clear and complete, and it has been answered completely, for which I am greatly appreciative. — Dave Radcliffe, Jul 16 '21 at 14:21

score 23 · Answer 1 · answered Mar 06 '11 at 15:38

23

An obvious upper bound is $N$ queries, since you can test each element individually. On the other hand, it takes at least $\Omega(N/\log N)$ queries: $N$ bits of information are required to identify the target subset, and each query can yield at most $O(\log N)$ bits of information, since each query has only $O(N)$ possible answers. To see that the upper bound is not sharp, consider the following strategy for $N=5$, which takes at most $4$ queries:

Guess $\{1,2,3,4,5\}$. If the result is $0$ or $5$, we have the answer. If the result is $1$ or $4$, bisection search (for the single member or the single missing element) gives the answer in three more queries. Suppose the result is $2$ (the strategy for $3$ is the same by symmetry).
Guess $\{1,2\}$. If the result is $2$, we have the answer. If the result is $0$, then bisection search on $\{3,4,5\}$ (for the single missing element) gives the answer in two more queries. Suppose the result is $1$. Then we know the answer is $\{a,b\}$ for some $a \in \{1,2\}$ and $b \in \{3,4,5\}$.
Guess $\{1,3\}$. If the result is $2$, we have the answer. If the result is $0$, then the answer is $\{2,b\}$ for some $b\in\{4,5\}$, and one more query gives the answer. Suppose the result is $1$. Then we know the answer is $\{1,4\}$, $\{1,5\}$, or $\{2,3\}$.
Guess $\{1,4\}$. The answer is $\{1,4\}$ if the result is $2$, or $\{1,5\}$ if the result is $1$, or $\{2,3\}$ if the result is $0$.

This example gives an improved upper bound asymptotic to $4N/5$. It seems likely that the correct answer is strictly $o(N)$ (i.e., eventually less than $cN$ for any fixed $c$), but whether or not it's $\Theta(N/\log N)$, I can't say.

answered Mar 06 '11 at 15:38

mjqxxxx

43,344

1

"each query can yield at most O(logN) bits of information, since each query has only O(N) possible answers". However, since you get to choose what set to query, doesn't that change the amount of information you could theoretically gain? – Mar 06 '11 at 17:19
2

@barrycarter: Here's a longer explanation. Before each query, there are some number $X$ of possible subsets. The result of the query will be $i \in {0,1,...,N}$, so the $X$ subsets are partitioned into $N+1$ disjoint classes $X_i$ (where the subsets in $X_i$ are those consistent with the result $i$). The largest of these must have size at least $X / (N+1)$, and in the worst case, that is the result you'll get. So you've reduced the number of possible subsets from $X$ to no less than $X / (N+1)$; i.e., you've gained no more than $\log_{2}(N+1)$ bits of information about the answer. – mjqxxxx Mar 06 '11 at 18:20
@mjqxxxx: we can say more. If your guess has $k$ elements, the central binomial coefficient is about $\frac {2^k}{\sqrt {\frac \pi 2 k}}$ so in the worst case you only get $\log_2 \sqrt {\frac \pi 2 k}$ bits (taking $\sqrt{\frac \pi 2}\approx 1)$. The first guess where $k=N$ will only get you about $\frac 12 \log_2N$ bits. I suspect after the first guess you want $k$ to be around $\frac N2$, so that gets you $\frac 12 \log_2 N -1$ bits – Ross Millikan Aug 25 '17 at 03:58
@RossMillikan I must confess I don't know much about these things, but find them rather interesting nonetheless. Can you suggest any readings on the topic for me? – Fimpellizzeri Aug 25 '17 at 04:33
@Fimpellizieri: I am not sure what you mean by "these things". The number of bits necessary to decide something is basic to information theory. A bit can give you the result of a yes/no question. Picking one item out of $N$ requires $\log_2 N$ bits because you can split the group in half and ask which half it is in $\log_2 N$ times and find it. One way to put a bound on the number of questions you need to ask is to count the bits needed and the number you get from each question, which is what we are doing here. – Ross Millikan Aug 25 '17 at 04:56
@RossMillikan I guess I mean information theory. I'd be interested to read more on it. – Fimpellizzeri Aug 25 '17 at 05:10
@RossMillikan "Picking one item out of requires log2 bits because you can split the group in half and ask which half it is in log2 times and find it." Huh? Doesn't that just show $\log_2 N$ is sufficient? – mathworker21 Nov 25 '19 at 21:09
@mathworker21: it would show it is sufficient if you could always split the group in half. There may be structure to the problem that makes it impossible to split the group in half because of correlations between the splits you have done and the splits you want to do. – Ross Millikan Sep 19 '23 at 03:18
@RossMillikan Yea, my point is that it doesn't show it requires $\log_2 N$ bits, rather than (roughly) $\log_2 N$ bits are sufficient. – mathworker21 Sep 19 '23 at 14:51

Mike Earnest · Accepted Answer · 2023-09-19T01:50:24.440

This can be solved in $\Theta(N/\log N)$ queries. First, here is a lemma:

Lemma: If you can solve $N$ in $Q$ queries, where one of the queries is the entire set $\{1,\dots,N\}$, then you can solve $2N+Q-1$ in $2Q$ queries, where one of the queries is the entire set.

Proof: Divide $\{1,\dots,2N+Q-1\}$ into three sets, $A,B$ and $C$, where $|A|=|B|=N$ and $|C|=Q-1$. By assumption, there exist subsets $A_1,\dots,A_{Q-1}$ such that you could find the unknown subset of $A$ alone by first guessing $A$, then guessing $A_1,\dots,A_{Q-1}$. Similarly, there exist subsets $B_1,\dots,B_{Q-1}$ for solving $B$. Finally, write $C=\{c_1,\dots,c_{Q-1}\}$.

The winning strategy is:

Guess the entire set, $\{1,\dots,2N+Q-1\}$.
Guess $B$.
For each $i\in \{1,\dots,Q-1\}$, guess $A_i\cup B_i$.
For each $i\in \{1,\dots,Q-1\}$, guess $A_i\cup (B\setminus B_i)\cup \{c_i\}$.

Using the parity of the the sum of the guesses $A_i\cup B_i$ and $A_i\cup (B\setminus B_i)\cup \{c_i\}$, you can determine whether or not $c_i\in S$. Then, using these same guesses, you get a system of equations which lets you solve for $|A_i \cap S|$ and $|B_i\cap S|$ for all $i$. This gives you enough info to determine $A\cap S$ and $B\cap S$, using the assumed strategy.$\tag*{$\square$}$

Let $\def\Opt{\operatorname{Opt}}\Opt(N)$ be the fewest number of guesses you need for $\{1,\dots,N\}$. Using the lemma and induction, you can show that $$ \Opt(k2^{k-1}+1)\le 2^k\qquad \text{for all }k\in \{0,1,2,\dots\} $$ More specifically, you can prove

Theorem Whenever $n,k$ are integers such that $n\le (k+1)2^{k}+1$, we have $\Opt(n)\le 2^{k+1}$.

Proof: We prove this by induction. During the inductive step, we will assume the truth of the theorem whenever $n\le k2^{k-1}$, and use this to prove the theorem holds whenever $n\le (k+1)2^k$.

Given $n,k$ with $n\le (k+1)2^k+1$, let $$ n' = \text{floor}\big((n-2^k+1)/2\big) $$ Combining this last inequality with $n\le (k+1)2^k+1$, we get $$ n'\le ((k+1)2^k+1-2^k+1)/2 = k2^{k-1}+1, $$ so we may apply the induction hypothesis to $n'$, to conclude $\Opt(n')\le 2^k$. Finally, using the Lemma, $$ \Opt(n)=\Opt(2n' + 2^k-1)\le 2\cdot 2^k=2^{k+1}. $$ This completes the proof by induction. $$\tag*{$\square$}$$

Finally, we conclude

Corollary $\Opt(N)\in O(N/\log N)$.

Proof: Think of $k$ as a function of $N$, where $k$ is the unique integer for which $$ k2^{k-1}+1<N\le (k+1)2^k+1 $$ These two inequalities imply that $\log_2 N\sim k$, where $f(N)\sim g(N)$ means that $\lim_{N\to\infty} g(N)/f(N)=1$. Therefore, $$ \Opt(N)\le \frac{2^{k+1}\cdot k}{k}\sim 4\cdot \frac{k2^{k-1}}{\log_2 N}\le 4\cdot \frac{N}{\log_2 N}.\tag*{$\square$} $$

The previous entropy argument already proved the matching lower bound, so we can safely say that $\Opt(N)=\Theta(N/\log N)$. More specifically, $$ 2 \frac{N}{\log_2 N} \lesssim \Opt(N)\lesssim 4\frac{N}{\log_2 N}, $$ where $f(N)\lesssim g(N)$ means $f(N)\le (1+\epsilon) g(N)$ for all $\epsilon>0$ and all $N:= N(\epsilon)$ sufficiently large.

Guessing a subset of $\{1,...,N\}$

2 Answers2

Linked

Related