1

Not long ago I requested an algorithm that can find the word minimizing the sum of squared Hamming distance of all words in a data set. The answer to this question is that the minimization problem is NP-hard, though there are some approximation algorithms one could use. This question expands on the previous question by considering a continuous analogue of the Hamming distance. The original problem was an optimization problem over a discrete but potentially large set. Sometimes, though, a continuous version of the minimization problem is easier, so I want to explore it.

Let $f_1:[0,1]\to\{0,1\}$ and $f_2:[0,1]\to\{0,1\}$ be two Lebesgue-integrable functions; we'll even assume they're continuous almost everywhere. The continuous version of the Hamming distance will then be $d(f_1, f_2)=\int_0^1 \left|f_1(t)-f_2(t)\right|dt$; this is the $L_1$-norm of the distance between the two functions, and it's easy to see that any $L_p$ norm for $p>1$ is simply the $L_1$ norm raised to the power of $1/p$. Given a data set $X_1, \ldots, X_n$ of such functions, the Fréchet mean function is a function $\hat{\mu}:[0,1]\to\{0,1\}$ that minimizes: $$\sum_{i=1}^{n}d^2(X_i,\hat{\mu}).$$

Is finding this mean function computationally feasible? I don't believe it is, but I'm not a computer scientist (I'm a mathematician), so I do not trust my judgement completely. I think proving this is NP-hard could be done like so: if you had an algorithm that solved this problem in polynomial time, you would have an algorithm that could solve the discrete analogue in polynomial time as well; simply translate your discrete words into functions that are continuous and constant everywhere except at jump point that are multiples of $1/m$, where $m$ is the length of the word. Since we know that the discrete problem is NP-hard, that makes the continuous analogue NP-hard as well. I do not know if this is valid inference for algorithmic computing time, though.

If I am correct, and this problem is also NP-hard, what do we gain (if anything) in terms of heuristics by viewing the problem as continuous rather than discrete? Is the heuristic of using a sample minimizer the only one we have, or do we gain others?

cgmil
  • 143
  • 4

1 Answers1

2

I have a suspicion (but no proof) that the continuous version of the problem might be tractable.

Let's assume that each function $f \in X$ has the form $f(x)=c_i$ if $d_i < x < d_{i+1}$, for some $d_i$'s such that $0=d_1 < d_2< \dots < d_n=1$ (the same $d_i$'s for every $f$). This assumption is without loss of generality, since you can always put the function into this form by letting the $d_i$'s be the points of discontinuity of the functions in $X$ together with $0,1$.

Define the vector $\mu$ by $$\mu_i = {1 \over |X|} \sum_{f \in X} f(x)$$ for some $x \in (d_i,d_{i+1})$ (the particular $x$ doesn't matter).

Consider the function $\eta$ given by $\eta(x) = \mu_i$ if $d_i < x < d_{i+1}$. I have a suspicion this function might be a global minimizer of your objective function. However, it is not admissible, because its range is not $\{0,1\}$. So we will instead define a function $\hat{\eta}$ that (intuitively) behaves much like $\eta$ for the purposes of this problem, except its range is $\{0,1\}$.

Specifically, define the function $\hat{\mu}$ so that $\hat{\mu}(x) = 1$ if $d_i < x < d_i + \mu_i (d_{i+1}-d_i)$ and $\hat{\mu}(x) = 0$ if $d_i + \mu_i (d_{i+1}-d_i) < x < d_{i+1}$. Notice that $\int_{x=d_i}^{x=d_{i+1}} \hat{\mu}(x) \; dx = \mu_i (d_{i+1}-d_i)$.

Now I have a suspicion that this function $\hat{\mu}$ might be a global minimizer for your problem. Or if not, it might be a good heuristic.

I have no proof, so this could be totally wrong. I suggest trying it out for yourself to see whether it seems to be correct or not.

D.W.
  • 167,959
  • 22
  • 232
  • 500